Page 1 of 2

Project: 2665 (Run 1, Clone 649, Gen 6)

Posted: Wed Jun 25, 2008 5:35 am
by GTron
One of my dedicated folders is having a problem with Project: 2665 (Run 1, Clone 649, Gen 6). It immediately dies with a CoreStatus = 66 (102), which is the "Shutdown requested by user." I did not, of course, attempt to shut it down. I also see in the syslog that 3 of the 4 cores segfault at that time. It has done this multiple times. As I would love to get past this WU and get this folder folding again, I'll try to dump it, but wanted to report it first.

System is Ubuntu 8.04, Linux 6.02beta1 client, q6600@2.88GHz, 2GB mem.

FAHlog.txt:

Code: Select all

--- Opening Log file [June 25 05:06:16] 


# SMP Client ##################################################################
###############################################################################

                       Folding@Home Client Version 6.02beta

                          http://folding.stanford.edu

###############################################################################
###############################################################################

Launch directory: /home/smpfold/foldingathome/CPU1
Executable: /home/smpfold/foldingathome/CPU1/fah6
Arguments: -forceasm -smp -verbosity 9 

Warning:
 By using the -forceasm flag, you are overriding
 safeguards in the program. If you did not intend to
 do this, please restart the program without -forceasm.
 If work units are not completing fully (and particularly
 if your machine is overclocked), then please discontinue
 use of the flag.

[05:06:16] - Ask before connecting: No
[05:06:16] - User name: GTron (Team 0)
[05:06:16] - User ID: 76E5E3D439736F7C
[05:06:16] - Machine ID: 5
[05:06:16] 
[05:06:16] Could not open work queue, generating new queue...
[05:06:16] - Autosending finished units...
[05:06:16] Trying to send all finished work units
[05:06:16] + No unsent completed units remaining.
[05:06:16] - Autosend completed
[05:06:16] - Preparing to get new work unit...
[05:06:16] + Attempting to get work packet
[05:06:16] - Will indicate memory of 1536 MB
[05:06:16] - Detect CPU. Vendor: GenuineIntel, Family: 6, Model: 15, Stepping: 7
[05:06:16] - Connecting to assignment server
[05:06:16] Connecting to http://assign.stanford.edu:8080/
[05:06:16] Posted data.
[05:06:16] Initial: 40AB; - Successful: assigned to (171.64.65.64).
[05:06:16] + News From Folding@Home: Welcome to Folding@Home
[05:06:16] Loaded queue successfully.
[05:06:16] Connecting to http://171.64.65.64:8080/
[05:06:21] Posted data.
[05:06:21] Initial: 0000; - Receiving payload (expected size: 4659162)
[05:06:30] - Downloaded at ~505 kB/s
[05:06:30] - Averaged speed for that direction ~505 kB/s
[05:06:30] + Received work.
[05:06:30] + Closed connections
[05:06:30] 
[05:06:30] + Processing work unit
[05:06:30] Core required: FahCore_a1.exe
[05:06:30] Core found.
[05:06:30] Working on Unit 01 [June 25 05:06:30]
[05:06:30] + Working ...
[05:06:30] - Calling './mpiexec -np 4 -host 127.0.0.1 ./FahCore_a1.exe -dir work/ -suffix 01 -checkpoint 15 -forceasm -verbose -lifeline 7412 -version 602'

[05:06:30] 
[05:06:30] *------------------------------*
[05:06:30] Folding@Home Gromacs SMP Core
[05:06:30] Version 1.74 (November 27, 2006)
[05:06:30] 
[05:06:30] Preparing to commence simulation
[05:06:30] - Ensuring status. Please wait.
[05:06:30] - Starting from initial work packet
[05:06:31] 
[05:06:31] Project: 2665 (Run 1, Clone 649, Gen 6)
[05:06:31] 
[05:06:31] Assembly optimizations on if available.
[05:06:31] Entering M.D.
[05:06:48]  on if available.
[05:06:48] Entering M.D.
[05:06:55] X in water
[05:06:55] Writing local files
[05:06:55] 
[05:06:55] Folding@hoFinalizing output
[05:06:55] Extra SSE boost OK.
[05:06:55] E boost OK.
[05:06:59] CoreStatus = 66 (102)
[05:06:59] + Shutdown requested by user. Exiting.***** Got a SIGTERM signal (15)
[05:06:59] Killing all core threads

Folding@Home Client Shutdown.
syslog:

Code: Select all

Jun 24 23:06:55 GHARVCO-17 kernel: [  708.238208] FahCore_a1.exe[7492]: segfault at 11114c0 rip 5ce05e rsp 40ef3aa0 error 4
Jun 24 23:06:55 GHARVCO-17 kernel: [  708.256264] FahCore_a1.exe[7493]: segfault at 1112360 rip 5ce07f rsp 40ef3aa0 error 4
Jun 24 23:06:55 GHARVCO-17 kernel: [  708.298413] FahCore_a1.exe[7496]: segfault at 11154c0 rip 5ce074 rsp 408c5aa0 error 4

Re: Project: 2665 (Run 1, Clone 649, Gen 6)

Posted: Wed Jun 25, 2008 8:15 am
by nwkelley
ok, thanks, i'll try and remember to check if anyone else is able to return the unit. If you get assigned the work unit again later with similar results please let us know!

Re: Project: 2665 (Run 1, Clone 649, Gen 6)

Posted: Thu Jun 26, 2008 1:00 pm
by sick willie
Same thing here. After 4 tries yesterday, I finally got another WU, and after finishing it, I'm back to this one: 2665 (Run 1, Clone 649, Gen 6), which does this:

Code: Select all

[12:08:37] 
[12:08:37] *------------------------------*
[12:08:37] Folding@Home Gromacs SMP Core
[12:08:37] Version 1.74 (November 27, 2006)
[12:08:37] 
[12:08:37] Preparing to commence simulation
[12:08:37] - Ensuring status. Please wait.
[12:08:37] Created dyn
[12:08:37] - Files status OK
[12:08:38] - Expanded 4658650 -> 24111057 (decompressed 517.5 percent)
[12:08:38] - Starting from initial work packet
[12:08:38] 
[12:08:38] Project: 2665 (Run 1, Clone 649, Gen 6)
[12:08:38] 
[12:08:38] Assembly optimizations on if available.
[12:08:38] Entering M.D.
[12:08:55] 5 percent)
[12:08:55] - Starting from initial work packet
[12:08:55] 
[12:08:55] Project: 2Entering M.D.
[12:08:55] one 649, Gen 6)
[12:08:55] 
[12:08:55] Entering M.D.
[12:09:02] 
[12:09:02] cal files
[12:09:02] Extra SSE boost OK.
[12:09:02] ocal files
[12:09:02] Finalizing output
[12:09:02] Extra SSE boost OK.
[12:09:07] CoreStatus = 66 (102)
[12:09:07] + Shutdown requested by user. Exiting.***** Got a SIGTERM signal (15)
[12:09:07] Killing all core threads

Folding@Home Client Shutdown.

Re: Project: 2665 (Run 1, Clone 649, Gen 6)

Posted: Thu Jun 26, 2008 1:40 pm
by Gary480six
I had a Quad machine running Ubuntu that got stuck on the same work unit today. P2665 R1 C649 G6
It would segmentation fault before it even started, and then just shut off Folding. I'm not sure how long it sat there doing nothing before I found it.

Also, because it did not EUE or some other defined fault, evey time I restarted Folding it would restart the same work unit. I was forced to dump out the work folder and queue several times until the Folding was issued a different work unit.

Re: Project: 2665 (Run 1, Clone 649, Gen 6)

Posted: Sat Jun 28, 2008 5:45 am
by ppetrone
OK. Thank you, I will notify the researcher in charge of this project of your report.

Paula

Re: Project: 2665 (Run 1, Clone 649, Gen 6)

Posted: Sat Jun 28, 2008 3:06 pm
by sick willie
I've already reported this on one of my machines. I came in this morning and another machine has this same WU, with the same result. :( It'd be okay if the machine could dump it and start over, but it requires a manual restart. It's bad enough that the points suck on the 2665s but this is just adding insult to injury.

Project: 2665 (Run 1, Clone 649, Gen 6)

Posted: Sun Jul 13, 2008 8:02 am
by Tigerbiten
Project: 2665 (Run 1, Clone 649, Gen 6) is just shutting the client down on my systems.

Code: Select all

[07:56:24] Project: 2665 (Run 1, Clone 649, Gen 6)
[07:56:24] 
[07:56:25] Assembly optimizations on if available.
[07:56:25] Entering M.D.
[07:56:40] 
[07:56:41] - Expanded 4658650 -> 24111057 (decompressed 517.5 percent)
[07:56:41] 
[07:56:41] Project: 2665 (Run 1, Clone 649, Gen 6)
[07:56:41] 
[07:56:41] Entering M.D.
[07:56:50] Protein: IBX in water
[07:56:50] Writing local files
[07:56:51] Extra SSE boost OK.
[07:56:52] tra SSE boost OK.
[07:56:56] CoreStatus = 66 (102)
[07:56:56] + Shutdown requested by user. Exiting.***** Got a SIGTERM signal (15)
[07:56:56] Killing all core threads

Folding@Home Client Shutdown.
Tried four times to get it to run.Thats all that happens.
Tried moveing it to a different computer. Same result.
Deleted it and grabbed a different protien.
I've saved the work folder and queue.dat file if that helps.

Luck ............ :D

Re: Project: 2665 (Run 1, Clone 649, Gen 6)

Posted: Sun Jul 13, 2008 4:43 pm
by GTron
This appears to be a bad WU and has been previously reported (see http://foldingforum.org/viewtopic.php?f=19&t=3513). Too bad it is still in circulation...

Greg

Re: Project: 2665 (Run 1, Clone 649, Gen 6)

Posted: Mon Jul 14, 2008 7:16 am
by nwkelley
thanks, looks like it indeed might be a bad WU, its not always easy to tell... i'll pass it along to peter right away who's monitoring the project...

Re: Project: 2665 (Run 1, Clone 649, Gen 6)

Posted: Mon Jul 14, 2008 7:17 am
by nwkelley
thanks guys!

Re: Project: 2665 (Run 1, Clone 649, Gen 6)

Posted: Mon Jul 14, 2008 1:49 pm
by toTOW
I see some people that were able to return that WU, but they all got partial credit for it ...

Re: Project: 2665 (Run 1, Clone 649, Gen 6)

Posted: Sat Jul 26, 2008 3:07 pm
by sick willie
And all this time later, yet another of my machines gets this same WU, with the same result.

Re: Project: 2665 (Run 1, Clone 649, Gen 6)

Posted: Sat Jul 26, 2008 10:24 pm
by rada
Same symptoms as above, 2665 (Run 1, Clone 649, Gen 6) on linux 6.02 release client. (Gigabyte GA-P35-DS3P mobo, Q6600, 2.6.25-gentoo-r6 kernel)

After multiple runs ending with error and segfault, I try `./fah6 -delete <work_unit_number>` but fah6 just seems to hang doing nothing, does not delete work unit. Ctrl-C and SIGTERM do not end 'fah6 -delete..' processes but SIGHUP does. Move queue.dat, work directory, FahCore_a1.exe to fah_debug directory (effectively delete them) but keep getting same unit assigned and having it immediately die.

Finally end up starting fresh with only fah6, mpiexec, and client.cfg; up and working again.

Re: Project: 2665 (Run 1, Clone 649, Gen 6)

Posted: Mon Aug 18, 2008 3:20 am
by sick willie
It looks like sooner or later, I'm going to report this WU for every one of my machines. :e(

Add another one to the count....

Re: Project: 2665 (Run 1, Clone 649, Gen 6)

Posted: Tue Aug 19, 2008 2:30 am
by sick willie
sick willie wrote:It looks like sooner or later, I'm going to report this WU for every one of my machines. :e(

Add another one to the count....
Which is apparently the goal. Add yet another one to the list. This makes 5 of my machines that have gotten this WU. I reported this WU on 06/26. If I've now received it on 5 machines in two months.... well never mind, it's obvious Stanford is going to continue circulating this WU ad infinitum. :eo