Page 1 of 1

More Bad WUs From Project: 2665 (Run 3, Clone 807, Gen 37)

Posted: Wed Aug 13, 2008 10:59 am
by Andrius
Project: 2665 (Run 3, Clone 807, Gen 37)
Died after 2 frames, failed to finalize, killed the client (with the popup).

Code: Select all

[August 13 ]
[07:38:53] Preparing to commence simulation
[07:38:53] - Looking at optimizations...
[07:38:53] - Created dyn
[07:38:53] - Files status OK
[07:39:08] - Expanded 4756305 -> 24426905 (decompressed 513.5 percent)
[07:39:08] - Starting from initial work packet
[07:39:08] 
[07:39:08] Project: 2665 (Run 3, Clone 807, Gen 37)
[07:39:08] 
[07:39:10] Assembly optimizations on if available.
[07:39:10] Entering M.D.
[07:39:16] Rejecting checkpoint
[07:39:17] 
[07:39:17] Writing local files
[07:39:18] 
[07:39:18] Writing local files
[07:39:26] Extra SSE boost OK.
[07:39:27] Writing local files
[07:39:27] Completed 0 out of 250000 steps  (0 percent)
[07:54:27] Timered checkpoint triggered.
[07:54:31] Writing local files
[07:54:31] Completed 2500 out of 250000 steps  (1 percent)
[08:09:31] Timered checkpoint triggered.
[08:09:36] Writing local files
[08:09:36] Completed 5000 out of 250000 steps  (2 percent)
[08:24:31] ning:  check for stray files
[08:24:31] 0.sas
[08:24:31] Warning:  check for stray files
[08:24:31] 
[08:24:31] Folding@home Core Shutdown: EARLY_UNIT_END
[08:24:31] Finalizing output
[08:24:31]  13501 bytes of core data to disk...
[08:24:31]   ... Done.
[08:26:31] 
[08:26:31] Folding@home Core Shutdown: EARLY_UNIT_END
[08:26:31] 
[08:26:31] Folding@home Core Shutdown: EARLY_UNIT_END
[08:26:35] CoreStatus = 63 (99)
[08:26:35] + Error starting Folding@Home core.
[08:26:40] 
[08:26:40] + Processing work unit
[08:26:40] Work type a1 not eligible for variable processors
[08:26:40] Core required: FahCore_a1.exe
[08:26:40] Core found.
[08:26:40] Working on queue slot 00 [August 13 08:26:40 UTC]
[08:26:40] + Working ...
[08:26:40] - Calling 'mpiexec -np 4 -channel shm -env MPICH_USE_SMP_OPTIMIZATIONS 1 -host 127.0.0.1 FahCore_a1.exe -dir work/ -suffix 00 -checkpoint 15 -verbose -lifeline 1496 -version 622'

[08:26:41] 
[08:26:41] *------------------------------*
[08:26:41] Folding@Home Gromacs SMP Core
[08:26:41] Version 1.76 (February 23, 2008)
[08:26:41] 
[08:26:41] Preparing to commence simulation
[08:26:41] - Looking at optimizations...
[08:26:41] - Created dyn
[08:26:41] - Files status OK
[08:26:41] 
[08:26:41] Folding@home Core Shutdown: MISSING_WORK_FILES
[08:26:41] Finalizing output
[08:28:44] CoreStatus = 1 (1)
[08:28:44] Client-core communications error: ERROR 0x1
[08:28:44] This is a sign of more serious problems, shutting down.
[10:00:06] - Autosending finished units... [August 13 10:00:06 UTC]
[10:00:06] Trying to send all finished work units
[10:00:06] + No unsent completed units remaining.
[10:00:06] - Autosend completed
UPDATE:
I deleted the bad WU with "-delete x" and tried downloading a new WU but got the same. It died again after 2 frames.
How do I get a different WU? Configure a new client with a different MachineID or something?

UPDATE2:
So after 3 failed attempts (and after deleting the work folder and queue.dat files 3 times) I got a different WU.
This time it's a Project: 2665 (Run 1, Clone 183, Gen 39) WU and it finished without problems.

Re: More Bad WUs From Project: 2665 (Run 3, Clone 807, Gen 37)

Posted: Wed Aug 13, 2008 6:30 pm
by d-con
Running 2665 (Run 3, Clone 165, Gen 40) I got a NaN at 19%.

This is windows smp client 5.91, stock box, no overclocking.

It's running the same WU again, now at 6%

Re: More Bad WUs From Project: 2665 (Run 3, Clone 807, Gen 37)

Posted: Thu Aug 14, 2008 8:53 pm
by d-con
I finally killed the client and moved off queue.dat and the work directory when it was running the same WU/PRG for the 4th time.

Of course, it assigned the same WU/CRG again, so I killed it again, and finally I got a different PRG, same project.

2665 (3, 165, 40) doesn't work. It gets a NaN every time at 19% on an unmodified AMD-based system with windows smp beta client 5.91

It's now running 2665(1, 377, 41). I hope this one completes.

-David

Re: More Bad WUs From Project: 2665 (Run 3, Clone 807, Gen 37)

Posted: Mon Aug 18, 2008 10:20 am
by toTOW
Someone else completed Project: 2665 (Run 3, Clone 807, Gen 37) successfully ...

Same for Project: 2665 (Run 3, Clone 165, Gen 40).

Re: More Bad WUs From Project: 2665 (Run 3, Clone 807, Gen 37)

Posted: Wed Aug 20, 2008 7:54 pm
by Andrius
@toTOW
Any idea on the client the other person used? I used 6.22 beta2 with the "SHM" fix.
Can you check this WU : Project: 2665 (Run 2, Clone 615, Gen 38)
[12:05:34] Completed 50000 out of 250000 steps (20 percent)
[12:14:16] Warning: long 1-4 interactions
[12:14:17] Gromacs cannot continue further.

I updated to R3 and it's at 7%.
I've done 13 units so far with the SMP client.
This was my second error so I don't think my machine is unstable.

Re: More Bad WUs From Project: 2665 (Run 3, Clone 807, Gen 37)

Posted: Wed Aug 20, 2008 8:01 pm
by bruce
Nobody has returned Project: 2665 (Run 2, Clone 615, Gen 38) yet. (this may be related to the delayed stats announced earlier today)

The data that the Mods can see doesn't tell us which client is being used. In any case, if 6.22b2-shm and 5.91 are probably both using the same version of Gromacs, and it's unlikely that the client version number matters when the Warning message is about a long 1-4 interaction.

Re: More Bad WUs From Project: 2665 (Run 3, Clone 807, Gen 37)

Posted: Wed Aug 20, 2008 8:25 pm
by Andrius
@bruce
True, but if it was a random instability the second run could fix it.
I'm guessing here but if it was done on a linux client it could explain the fact it was completed (not sure what projects are done on the linux SMP clients).
If it fails again I'll try again.

UPDATE (August 21 21:10 UTC) : The unit finished successfully on the second run. :shock:
[20:39:19] Project: 2665 (Run 2, Clone 615, Gen 38)
[21:06:58] + Number of Units Completed: 15