Page 1 of 2

Project: 2665 (Run 0, Clone 479, Gen 20)

Posted: Tue Aug 26, 2008 3:53 pm
by 314159
Linux Client-Q6600-stock clock (not the same computer for which I have posted other EUEs)

At least this one did not "hang". :)

Code: Select all

[12:19:47] Core required: FahCore_a1.exe
[12:19:47] Core found.
[12:19:47] Working on Unit 08 [August 26 12:19:47]
[12:19:47] + Working ...
[12:19:47] - Calling './mpiexec -np 4 -host 127.0.0.1 ./FahCore_a1.exe -dir work/ -suffix 08 -checkpoint 15 -forceasm -verbose -lifeline 5847 -version 602'

[12:19:47] 
[12:19:47] *------------------------------*
[12:19:47] Folding@Home Gromacs SMP Core
[12:19:47] Version 1.74 (November 27, 2006)
[12:19:47] 
[12:19:47] Preparing to commence simulation
[12:19:47] - Ensuring status. Please wait.
[12:20:04] - Assembly optimizations manually forced on.
[12:20:04] - Not checking prior termination.
[12:20:05] - Expanded 4735055 -> 24426905 (decompressed 515.8 percent)
[12:20:05] - Starting from initial work packet
[12:20:05] 
[12:20:05] Project: 2665 (Run 0, Clone 479, Gen 20)
[12:20:05] 
[12:20:05] Assembly optimizations on if available.
[12:20:05] Entering M.D.
[12:20:11] Rejecting checkpoint
[12:20:13] Protein: HGG in waterExtra SSE boost OK.
[12:20:13] 
[12:20:13] Extra SSE boost OK.
[12:20:14] Writing local files
[12:20:14] Completed 0 out of 250000 steps  (0 percent)
[12:20:14] 
[12:20:14] Folding@home Core Shutdown: INTERRUPTED
[12:20:18] CoreStatus = 0 (0)
[12:20:18] Client-core communications error: ERROR 0x0
[12:20:18] Deleting current work unit & continuing...
[12:24:40] - Warning: Could not delete all work unit files (8): Core returned invalid code
[12:24:40] Trying to send all finished work units
[12:24:40] + No unsent completed units remaining.
[12:24:40] - Preparing to get new work unit...
[12:24:40] + Attempting to get work packet
[12:24:40] - Will indicate memory of 1024 MB
[12:24:40] - Connecting to assignment server
[12:24:40] Connecting to http://assign.stanford.edu:8080/
[12:24:40] Posted data.
[12:24:40] Initial: 40AB; - Successful: assigned to (171.64.65.64).
[12:24:40] + News From Folding@Home: Welcome to Folding@Home
[12:24:40] Loaded queue successfully.
[12:24:40] Connecting to http://171.64.65.64:8080/
[12:24:46] Posted data.
[12:24:46] Initial: 0000; - Receiving payload (expected size: 4735567)
[12:24:49] - Downloaded at ~1541 kB/s
[12:24:49] - Averaged speed for that direction ~1133 kB/s
[12:24:49] + Received work.
[12:24:49] + Closed connections
[12:24:54] 
[12:24:54] + Processing work unit
[12:24:54] Core required: FahCore_a1.exe
[12:24:54] Core found.
[12:24:54] Working on Unit 09 [August 26 12:24:54]
[12:24:54] + Working ...
[12:24:54] - Calling './mpiexec -np 4 -host 127.0.0.1 ./FahCore_a1.exe -dir work/ -suffix 09 -checkpoint 15 -forceasm -verbose -lifeline 5847 -version 602'

[12:24:54] 
[12:24:54] *------------------------------*
[12:24:54] Folding@Home Gromacs SMP Core
[12:24:54] Version 1.74 (November 27, 2006)
[12:24:54] 
[12:24:54] Preparing to commence simulation
[12:24:54] - Ensuring status. Please wait.
[12:25:11] - Assembly optimizations manually forced on.
[12:25:11] - Not checking prior termination.
[12:25:11] - Expanded 4735055 -> 24426905 (decompressed 515.8 percent)
[12:25:12] - Starting from initial work packet
[12:25:12] 
[12:25:12] Project: 2665 (Run 0, Clone 479, Gen 20)
[12:25:12] 
[12:25:12] Assembly optimizations on if available.
[12:25:12] Entering M.D.
[12:25:18] Rejecting checkpoint
[12:25:20] Protein: HGG in waterExtra SSE boost OK.
[12:25:20] 
[12:25:20] Extra SSE boost OK.
[12:25:21] Writing local files
[12:25:21] Completed 0 out of 250000 steps  (0 percent)
[12:25:21] 
[12:25:21] Folding@home Core Shutdown: INTERRUPTED
[12:25:25] CoreStatus = 0 (0)
[12:25:25] Client-core communications error: ERROR 0x0
[12:25:25] Deleting current work unit & continuing...
[12:29:47] - Warning: Could not delete all work unit files (9): Core returned invalid code
[12:29:47] Trying to send all finished work units
[12:29:47] + No unsent completed units remaining.
[12:29:47] - Preparing to get new work unit...
[12:29:47] + Attempting to get work packet
[12:29:47] - Will indicate memory of 1024 MB
[12:29:47] - Connecting to assignment server
[12:29:47] Connecting to http://assign.stanford.edu:8080/
[12:29:47] Posted data.
[12:29:47] Initial: 40AB; - Successful: assigned to (171.64.65.64).
[12:29:47] + News From Folding@Home: Welcome to Folding@Home
[12:29:47] Loaded queue successfully.
[12:29:47] Connecting to http://171.64.65.64:8080/
[12:29:47] Posted data.
[12:29:47] Initial: 0000; - Error: Bad packet type from server, expected work assignment
[12:29:48] - Attempt #1  to get work failed, and no other work to do.
             Waiting before retry.
[12:29:54] + Attempting to get work packet
[12:29:54] - Will indicate memory of 1024 MB
[12:29:54] - Connecting to assignment server
[12:29:54] Connecting to http://assign.stanford.edu:8080/
[12:29:54] Posted data.
[12:29:54] Initial: 40AB; - Successful: assigned to (171.64.65.64).
[12:29:54] + News From Folding@Home: Welcome to Folding@Home
[12:29:54] Loaded queue successfully.
[12:29:54] Connecting to http://171.64.65.64:8080/
[12:30:00] Posted data.
[12:30:00] Initial: 0000; - Receiving payload (expected size: 4675358)
[12:30:03] - Downloaded at ~1521 kB/s
[12:30:03] - Averaged speed for that direction ~1211 kB/s
[12:30:03] + Received work.
[12:30:03] + Closed connections
[12:30:08] 
[12:30:08] + Processing work unit
[12:30:08] Core required: FahCore_a1.exe
[12:30:08] Core found.
[12:30:08] Working on Unit 00 [August 26 12:30:08]
[12:30:08] + Working ...
[12:30:08] - Calling './mpiexec -np 4 -host 127.0.0.1 ./FahCore_a1.exe -dir work/ -suffix 00 -checkpoint 15 -forceasm -verbose -lifeline 5847 -version 602'

[12:30:08] 
[12:30:08] *------------------------------*
[12:30:08] Folding@Home Gromacs SMP Core
[12:30:08] Version 1.74 (November 27, 2006)
[12:30:08] 
[12:30:08] Preparing to commence simulation
[12:30:08] - Ensuring status. Please wait.
[12:30:25] - Assembly optimizations manually forced on.
[12:30:25] - Not checking prior termination.
[12:30:26] - Expanded 4674846 -> 24111057 (decompressed 515.7 percent)
[12:30:26] - Starting from initial work packet
[12:30:26] 
[12:30:26] Project: 2665 (Run 1, Clone 208, Gen 43)
[12:30:26] 
[12:30:26] Assembly optimizations on if available.
[12:30:26] Entering M.D.
[12:30:32] Rejecting checkpoint
[12:30:33] Protein: IBX in water
[12:30:33] Writing local files
[12:30:34] Extra SSE boost OK.
[12:30:34] Writing local files
[12:30:35] Completed 0 out of 250000 steps  (0 percent)
[12:44:17] Writing local files
[12:44:17] Completed 2500 out of 250000 steps  (1 percent)
[12:58:03] Writing local files
[12:58:03] Completed 5000 out of 250000 steps  (2 percent)
[13:11:52] Writing local files
[13:11:52] Completed 7500 out of 250000 steps  (3 percent)
[13:25:36] Writing local files
[13:25:36] Completed 10000 out of 250000 steps  (4 percent)
[13:39:19] Writing local files
[13:39:19] Completed 12500 out of 250000 steps  (5 percent)

Re: Project: 2665 (Run 0, Clone 479, Gen 20)

Posted: Wed Aug 27, 2008 1:06 am
by toTOW
I see more than 20 reports for this WU ... all are EUE :(

Re: Project: 2665 (Run 0, Clone 479, Gen 20)

Posted: Wed Aug 27, 2008 3:48 am
by 314159
:e?:

At least the darn thing doesn't simply cause the client to stop and remain idle until detected.

The thing that burns me is that I have one Quad that has been assigned the identical defective WU on at least 5 occasions. It runs for many hours and then 0xwhatevers and starts again from scratch.

Due to the lack of error trapping (in certain cases), the Powers that Be would be totally unaware of the failure(s) and others have undoubtedly suffered through the same thing.

I know that the SMP processing is extremely difficult to debug.

On the other hand, it seems to me that the code could be modified to revert to the last checkpoint and actually send results back to our friends at Stanford for ALL of the 0x cases for which the cause has not yet been determined. This communication should eliminate multiple assignments of the same WU.

Partial credit would be welcomed by many who fold for points (I fold "In Memory of").

The main benefit would be the elimination of the frustration from what I will continue to refer to as the Projects #1 Asset - i.e. its enthusiastic base of VOLUNTEERS. :)

The combined Linux Client is supposed to be a "final" stable release? :?:

Re: Project: 2665 (Run 0, Clone 479, Gen 20)

Posted: Wed Aug 27, 2008 3:56 am
by Leoslocks
I too fold [inMEMORYof]

I have seen a similar SMP situation where the client hangs or just plain shuts down after an EUE .
Using Vista Ultimate 64, dropping in the 6.22beta2r3 executable cured the 'shut down' after EUE issue.

Re: Project: 2665 (Run 0, Clone 479, Gen 20)

Posted: Sun Aug 31, 2008 1:09 pm
by ChelseaOilman
I couldn't get anywhere with this WU.

Code: Select all

[12:04:27] Working on queue slot 08 [August 30 12:04:27 UTC]
[12:04:27] + Working ...
[12:04:27] - Calling 'mpiexec -np 4 -channel shm -env MPICH_USE_SMP_OPTIMIZATIONS 1 -host 127.0.0.1 FahCore_a1.exe -dir work/ -suffix 08 -checkpoint 15 -forceasm -verbose -lifeline 2816 -version 622'

[12:04:28] 
[12:04:28] *------------------------------*
[12:04:28] Folding@Home Gromacs SMP Core
[12:04:28] Version 1.76 (February 23, 2008)
[12:04:28] 
[12:04:28] Preparing to commence simulation
[12:04:28] - Ensuring status. Please wait.
[12:04:45] - Assembly optimizations manually forced on.
[12:04:45] - Not checking prior termination.
[12:05:01] - Expanded 4735055 -> 24426905 (decompressed 515.8 percent)
[12:05:02] - Starting from initial work packet
[12:05:02] 
[12:05:02] Project: 2665 (Run 0, Clone 479, Gen 20)
[12:05:02] 
[12:05:39] Assembly optimizations on if available.
[12:05:39] Entering M.D.
[12:05:46] Rejecting checkpoint
[12:05:48] 
[12:05:48] Writing local files
[12:05:48] 
[12:05:48] Writing local files
[12:05:59] Extra SSE boost OK.
[12:05:59] Writing local files
[12:06:00]  send back what have done.
[12:06:00] logfile size: 9422Gromacs cannot continue further.
[12:06:00] Going to send back what have done.
[12:06:00] logfile size: 9422
[12:06:00] - Writing 9958 bytes of core data to disk...
[12:06:00]   ... Done.
[12:06:00] o delete work/wudata_08.bed
[12:06:00] - Failed to delete work/wudata_08.sas
[12:06:00] - Failed to delete work/wudata_08.goe
[12:06:00] Warning:  check for stray files
[12:06:00] 
[12:06:00] Folding@home Core Shutdown: EARLY_UNIT_END
[12:06:00] Finalizing output
[12:08:07] CoreStatus = 63 (99)
[12:08:07] + Error starting Folding@Home core.
[12:08:12] 
[12:08:12] + Processing work unit
[12:08:12] Work type a1 not eligible for variable processors
[12:08:12] Core required: FahCore_a1.exe
[12:08:12] Core found.
[12:08:12] Working on queue slot 08 [August 30 12:08:12 UTC]
[12:08:12] + Working ...
[12:08:12] - Calling 'mpiexec -np 4 -channel shm -env MPICH_USE_SMP_OPTIMIZATIONS 1 -host 127.0.0.1 FahCore_a1.exe -dir work/ -suffix 08 -checkpoint 15 -forceasm -verbose -lifeline 2816 -version 622'

[12:08:13] 
[12:08:13] *------------------------------*
[12:08:13] Folding@Home Gromacs SMP Core
[12:08:13] Version 1.76 (February 23, 2008)
[12:08:13] 
[12:08:13] Preparing to commence simulation
[12:08:13] - Ensuring status. Please wait.
[12:08:30] - Assembly optimizations manually forced on.
[12:08:30] - Not checking prior termination.
[12:10:13] SING_WORK_FILES
[12:10:13] Finalizing output
[12:10:30] NG_WORK_FILES
[12:10:30] Finalizing output
[12:10:35] CoreStatus = 1 (1)
[12:10:35] Client-core communications error: ERROR 0x1
[12:10:35] Deleting current work unit & continuing...
[12:12:55] - Warning: Could not delete all work unit files (8): Core returned invalid code
[12:12:55] Trying to send all finished work units
[12:12:55] + No unsent completed units remaining.
[12:12:55] - Preparing to get new work unit...
[12:12:55] + Attempting to get work packet
[12:12:55] - Will indicate memory of 2047 MB
[12:12:55] - Connecting to assignment server
[12:12:55] Connecting to http://assign.stanford.edu:8080/
[12:12:55] Posted data.
[12:12:55] Initial: 40AB; - Successful: assigned to (171.64.65.64).
[12:12:55] + News From Folding@Home: Welcome to Folding@Home
[12:12:55] Loaded queue successfully.
[12:12:55] Connecting to http://171.64.65.64:8080/
[12:13:01] Posted data.
[12:13:01] Initial: 0000; - Receiving payload (expected size: 4735567)
[12:13:10] - Downloaded at ~513 kB/s
[12:13:10] - Averaged speed for that direction ~498 kB/s
[12:13:10] + Received work.
[12:13:12] + Closed connections
[12:13:17] 
[12:13:17] + Processing work unit
[12:13:17] Work type a1 not eligible for variable processors
[12:13:17] Core required: FahCore_a1.exe
[12:13:17] Core found.
[12:13:17] Working on queue slot 09 [August 30 12:13:17 UTC]
[12:13:17] + Working ...
[12:13:17] - Calling 'mpiexec -np 4 -channel shm -env MPICH_USE_SMP_OPTIMIZATIONS 1 -host 127.0.0.1 FahCore_a1.exe -dir work/ -suffix 09 -checkpoint 15 -forceasm -verbose -lifeline 2816 -version 622'

[12:13:18] 
[12:13:18] *------------------------------*
[12:13:18] Folding@Home Gromacs SMP Core
[12:13:18] Version 1.76 (February 23, 2008)
[12:13:18] 
[12:13:18] Preparing to commence simulation
[12:13:18] - Ensuring status. Please wait.
[12:13:35] - Assembly optimizations manually forced on.
[12:13:35] - Not checking prior termination.
[12:13:50] - Expanded 4735055 -> 24426905 (decompressed 515.8 percent)
[12:13:50] - Starting from initial work packet
[12:13:50] 
[12:13:50] Project: 2665 (Run 0, Clone 479, Gen 20)
[12:13:50] 
[12:14:41] Assembly optimizations on if available.
[12:14:41] Entering M.D.
[12:14:48] Rejecting checkpoint
[12:14:50] Protein: HGG in water
[12:14:50] Writing local files
[12:15:00] Extra SSE boost OK.
[12:15:01] ue further.
[12:15:01] Going to send back what have done.
[12:15:01] logfile size: 9422
[12:15:01] - Writing 9958 bytes of core data to disk...
[12:15:01]   ... Done.
[12:15:01] - Failed to delete work/wudata_09.arc
[12:15:01]  9958 bytes of core data to disk...
[12:15:01]   ... Done.
[12:15:01] o delete work/wudata_09.bed
[12:15:01] - Failed to delete work/wudata_09.sas
[12:15:01] - Failed to delete work/wudata_09.goe
[12:15:01] Warning:  check for stray files
[12:15:01] ck for stray files
[12:15:01] 9.xvg
[12:15:01] Warning:  check for stray files
[12:15:01] 
[12:15:01] Folding@home Core Shutdown: EARLY_UNIT_END
[12:15:01] Finalizing output
[12:17:06] CoreStatus = 63 (99)
[12:17:06] + Error starting Folding@Home core.
[12:17:06] - Attempting to download new core...
[12:17:06] + Downloading new core: FahCore_a1.exe
[12:17:06] Downloading core (/~pande/Win32/x86_Deino/Core_a1.fah from www.stanford.edu)
[12:17:06] Initial: AFDE; + 10240 bytes downloaded
<SNIP>
[12:17:08] Initial: 24B3; + 795847 bytes downloaded
[12:17:08] Verifying core Core_a1.fah...
[12:17:08] Signature is VALID
[12:17:08] 
[12:17:08] Trying to unzip core FahCore_a1.exe
[12:17:12] Decompressed FahCore_a1.exe (2117632 bytes) successfully
[12:17:17] + Core successfully engaged
[12:17:25] 
[12:17:25] + Processing work unit
[12:17:25] Work type a1 not eligible for variable processors
[12:17:25] Core required: FahCore_a1.exe
[12:17:25] Core found.
[12:17:25] Working on queue slot 09 [August 30 12:17:25 UTC]
[12:17:25] + Working ...
[12:17:25] - Calling 'mpiexec -np 4 -channel shm -env MPICH_USE_SMP_OPTIMIZATIONS 1 -host 127.0.0.1 FahCore_a1.exe -dir work/ -suffix 09 -checkpoint 15 -forceasm -verbose -lifeline 2816 -version 622'

[12:17:26] 
[12:17:26] *------------------------------*
[12:17:26] Folding@Home Gromacs SMP Core
[12:17:26] Version 1.76 (February 23, 2008)
[12:17:26] 
[12:17:26] Preparing to commence simulation
[12:17:26] - Ensuring status. Please wait.
[12:17:43] - Assembly optimizations manually forced on.
[12:17:43] - Not checking prior termination.
[12:19:43] 
[12:19:43] Folding@home Core Shutdown: MISSING_WORK_FILES
[12:19:43] Finalizing output
[12:19:47] CoreStatus = 1 (1)
[12:19:47] Client-core communications error: ERROR 0x1
[12:19:47] Deleting current work unit & continuing...
[12:22:09] - Warning: Could not delete all work unit files (9): Core returned invalid code
[12:22:09] Trying to send all finished work units
[12:22:09] + No unsent completed units remaining.
[12:22:09] - Preparing to get new work unit...
[12:22:09] + Attempting to get work packet
[12:22:09] - Will indicate memory of 2047 MB
[12:22:09] - Connecting to assignment server
[12:22:09] Connecting to http://assign.stanford.edu:8080/
[12:22:10] Posted data.
[12:22:10] Initial: 40AB; - Successful: assigned to (171.64.65.64).
[12:22:10] + News From Folding@Home: Welcome to Folding@Home
[12:22:10] Loaded queue successfully.
[12:22:10] Connecting to http://171.64.65.64:8080/
[12:22:15] Posted data.
[12:22:15] Initial: 0000; - Receiving payload (expected size: 4735567)
[12:22:26] - Downloaded at ~420 kB/s
[12:22:26] - Averaged speed for that direction ~482 kB/s
[12:22:26] + Received work.
[12:22:28] + Closed connections
[12:22:33] 
[12:22:33] + Processing work unit
[12:22:33] Work type a1 not eligible for variable processors
[12:22:33] Core required: FahCore_a1.exe
[12:22:33] Core found.
[12:22:33] Working on queue slot 00 [August 30 12:22:33 UTC]
[12:22:33] + Working ...
[12:22:33] - Calling 'mpiexec -np 4 -channel shm -env MPICH_USE_SMP_OPTIMIZATIONS 1 -host 127.0.0.1 FahCore_a1.exe -dir work/ -suffix 00 -checkpoint 15 -forceasm -verbose -lifeline 2816 -version 622'

[12:22:33] 
[12:22:33] *------------------------------*
[12:22:33] Folding@Home Gromacs SMP Core
[12:22:33] Version 1.76 (February 23, 2008)
[12:22:33] 
[12:22:33] Preparing to commence simulation
[12:22:33] - Ensuring status. Please wait.
[12:22:50] - Assembly optimizations manually forced on.
[12:22:50] - Not checking prior termination.
[12:23:07] - Expanded 4735055 -> 24426905 (decompressed 515.8 percent)
[12:23:07] - Starting from initial work packet
[12:23:07] 
[12:23:07] Project: 2665 (Run 0, Clone 479, Gen 20)
[12:23:07] 
[12:23:56] Assembly optimizations on if available.
[12:23:56] Entering M.D.
[12:24:03] Rejecting checkpoint
[12:24:05] PWriting local files
[12:24:05] 
[12:24:05] Writing local files
[12:24:15] Extra SSE boost OK.
[12:24:16] ue further.
[12:24:16] Going to send back what have done.
[12:24:16] logfile size: 9421
[12:24:16] - Writing 9957 bytes of core data to disk...
[12:24:16]   ... Done.
[12:24:16] - Failed to delete work/wudata_00.arc
[12:24:16] - Failed to delete work/wudata_00.xtc
[12:24:16] - Failed to delete work/wudata_00.bed
[12:24:16] - Failed to delete work/wudata_00.sas
[12:24:16] - Failed to delete work/wudata_00.goe
[12:24:16] Warning:  check for stray files
[12:24:16] 
[12:24:16] Folding@home Core Shutdown: EARLY_UNIT_END
[12:24:16] Finalizing output
[12:26:29] CoreStatus = 63 (99)
[12:26:29] + Error starting Folding@Home core.
[12:26:34] 
[12:26:34] + Processing work unit
[12:26:34] Work type a1 not eligible for variable processors
[12:26:34] Core required: FahCore_a1.exe
[12:26:34] Core found.
[12:26:34] Working on queue slot 00 [August 30 12:26:34 UTC]
[12:26:34] + Working ...
[12:26:34] - Calling 'mpiexec -np 4 -channel shm -env MPICH_USE_SMP_OPTIMIZATIONS 1 -host 127.0.0.1 FahCore_a1.exe -dir work/ -suffix 00 -checkpoint 15 -forceasm -verbose -lifeline 2816 -version 622'

[12:26:34] 
[12:26:34] *------------------------------*
[12:26:34] Folding@Home Gromacs SMP Core
[12:26:34] Version 1.76 (February 23, 2008)
[12:26:34] 
[12:26:34] Preparing to commence simulation
[12:26:34] - Ensuring status. Please wait.
[12:26:34] y forced on.
[12:26:34] - Not checking prior termination.
[12:26:34] 
[12:26:34] Folding@home Core Shutdown: MISSING_WORK_FILES
[12:26:34] Finalizing output
[12:28:51] NG_WORK_FILES
[12:28:51] Finalizing output
[12:28:54] CoreStatus = 1 (1)
[12:28:54] Client-core communications error: ERROR 0x1
[12:28:54] - Attempting to download new core...
[12:28:54] + Downloading new core: FahCore_a1.exe
[12:28:54] Downloading core (/~pande/Win32/x86_Deino/Core_a1.fah from www.stanford.edu)
[12:28:55] Initial: AFDE; + 10240 bytes downloaded
<SNIP>
[12:28:56] Initial: 24B3; + 795847 bytes downloaded
[12:28:56] Verifying core Core_a1.fah...
[12:28:56] Signature is VALID
[12:28:56] 
[12:28:56] Trying to unzip core FahCore_a1.exe
[12:28:57] Decompressed FahCore_a1.exe (2117632 bytes) successfully
[12:29:02] + Core successfully engaged
[12:29:03] Deleting current work unit & continuing...
[12:31:25] - Warning: Could not delete all work unit files (0): Core returned invalid code
[12:31:25] Trying to send all finished work units
[12:31:25] + No unsent completed units remaining.
[12:31:25] - Preparing to get new work unit...
[12:31:25] + Attempting to get work packet
[12:31:25] - Will indicate memory of 2047 MB
[12:31:25] - Connecting to assignment server
[12:31:25] Connecting to http://assign.stanford.edu:8080/
[12:31:25] Posted data.
[12:31:25] Initial: 40AB; - Successful: assigned to (171.64.65.64).
[12:31:25] + News From Folding@Home: Welcome to Folding@Home
[12:31:25] Loaded queue successfully.
[12:31:25] Connecting to http://171.64.65.64:8080/
[12:31:26] Posted data.
[12:31:26] Initial: 0000; - Error: Bad packet type from server, expected work assignment
[12:31:26] - Attempt #1  to get work failed, and no other work to do.
Waiting before retry.
[12:31:34] + Attempting to get work packet
[12:31:34] - Will indicate memory of 2047 MB
[12:31:34] - Connecting to assignment server
[12:31:34] Connecting to http://assign.stanford.edu:8080/
[12:31:34] Posted data.
[12:31:34] Initial: 40AB; - Successful: assigned to (171.64.65.64).
[12:31:34] + News From Folding@Home: Welcome to Folding@Home
[12:31:35] Loaded queue successfully.
[12:31:35] Connecting to http://171.64.65.64:8080/
[12:31:42] Posted data.
[12:31:42] Initial: 0000; - Receiving payload (expected size: 4682396)
[12:31:50] - Downloaded at ~571 kB/s
[12:31:50] - Averaged speed for that direction ~500 kB/s
[12:31:50] + Received work.
[12:31:52] + Closed connections
[12:31:57] 
[12:31:57] + Processing work unit
[12:31:57] Work type a1 not eligible for variable processors
[12:31:57] Core required: FahCore_a1.exe
[12:31:57] Core found.
[12:31:57] Working on queue slot 01 [August 30 12:31:57 UTC]
[12:31:57] + Working ...
[12:31:57] - Calling 'mpiexec -np 4 -channel shm -env MPICH_USE_SMP_OPTIMIZATIONS 1 -host 127.0.0.1 FahCore_a1.exe -dir work/ -suffix 01 -checkpoint 15 -forceasm -verbose -lifeline 2816 -version 622'

[12:31:58] 
[12:31:58] *------------------------------*
[12:31:58] Folding@Home Gromacs SMP Core
[12:31:58] Version 1.76 (February 23, 2008)
[12:31:58] 
[12:31:58] Preparing to commence simulation
[12:31:58] - Ensuring status. Please wait.
[12:32:15] - Assembly optimizations manually forced on.
[12:32:15] - Not checking prior termination.
[12:32:29] - Expanded 4681884 -> 24111057 (decompressed 514.9 percent)
[12:32:29] - Starting from initial work packet
[12:32:29] 
[12:32:29] Project: 2665 (Run 1, Clone 664, Gen 46)
[12:32:29] 
[12:33:14] Assembly optimizations on if available.
[12:33:14] Entering M.D.
[12:33:21] Rejecting checkpoint
[12:33:23] PWriting local files
[12:33:23] 
[12:33:23] Writing local files
[12:33:32] Extra SSE boost OK.
[12:33:33] Writing local files
[12:33:33] Completed 0 out of 250000 steps  (0 percent)
Had to use qfix to upload the three wuresults_0x.dat files.

Re: Project: 2665 (Run 0, Clone 479, Gen 20)

Posted: Sun Aug 31, 2008 2:21 pm
by 314159
Geesh!

We spend the time and expense of running these defective WUs.
We come up with a 0xwhatever - multiple times each in most cases.
We take the time to report the defective WU here - more than 20 times on this one (per toTOW).

Nothing happens!!! :?

Why has this one not been pulled?
Why are we even wasting our time and effort in reporting these?

The worst news is that I have had a defective one re-assigned after completion of a good WU, only to have to go through the entire process again.

I am a bit miffed. (to say it as politely and mildly as possible) :)

I know that you Forum Mods (and above) have contacts with the Pande folks.
Is there not something that you can do?
Perhaps it's time to bring up the subject in your Mods Forum or whatever you call it here?

People are dropping out of the project like flies!! :e(

Re: Project: 2665 (Run 0, Clone 479, Gen 20) - ERROR 0X0

Posted: Sun Aug 31, 2008 3:18 pm
by 314159
Geesh! (#2) :(

Did I mention that defective WUs are being assigned to the same machine that attempted to complete them previously - and often days later?
What a waste!!

Here is the evidence (at least it is not a failure at frame 99): :)

Code: Select all

[14:38:02] Working on Unit 09 [August 31 14:38:02]
[14:38:02] + Working ...
[14:38:02] - Calling './mpiexec -np 4 -host 127.0.0.1 ./FahCore_a1.exe -dir work/ -suffix 09 -checkpoint 15 -forceasm -verbose -lifeline 5623 -version 602'

[14:38:02] 
[14:38:02] *------------------------------*
[14:38:02] Folding@Home Gromacs SMP Core
[14:38:02] Version 1.74 (November 27, 2006)
[14:38:02] 
[14:38:02] Preparing to commence simulation
[14:38:02] - Ensuring status. Please wait.
[14:38:19] - Assembly optimizations manually forced on.
[14:38:19] - Not checking prior termination.
[14:38:20] - Expanded 4735055 -> 24426905 (decompressed 515.8 percent)
[14:38:20] - Starting from initial work packet
[14:38:20] 
[14:38:20] Project: 2665 (Run 0, Clone 479, Gen 20)
[14:38:20] 
[14:38:21] Assembly optimizations on if available.
[14:38:21] Entering M.D.
[14:38:27] Rejecting checkpoint
[14:38:28] Protein: HGG in waterExtra SSE boost OK.
[14:38:28] 
[14:38:28] Extra SSE boost OK.
[14:38:29] Writing local files
[14:38:29] Completed 0 out of 250000 steps  (0 percent)
[14:38:29] 
[14:38:29] Folding@home Core Shutdown: INTERRUPTED
[14:38:34] CoreStatus = 0 (0)
[14:38:34] Client-core communications error: ERROR 0x0
[14:38:34] Deleting current work unit & continuing...
[14:42:55] - Warning: Could not delete all work unit files (9): Core returned invalid code
[14:42:55] Trying to send all finished work units
[14:42:55] + No unsent completed units remaining.
[14:42:55] - Preparing to get new work unit...
[14:42:55] + Attempting to get work packet
[14:42:55] - Will indicate memory of 1000 MB
[14:42:55] - Connecting to assignment server
[14:42:55] Connecting to http://assign.stanford.edu:8080/
[14:42:56] Posted data.
[14:42:56] Initial: 40AB; - Successful: assigned to (171.64.65.64).
[14:42:56] + News From Folding@Home: Welcome to Folding@Home
[14:42:56] Loaded queue successfully.
[14:42:56] Connecting to http://171.64.65.64:8080/
[14:43:01] Posted data.
[14:43:01] Initial: 0000; - Receiving payload (expected size: 4735567)
[14:43:04] - Downloaded at ~1541 kB/s
[14:43:04] - Averaged speed for that direction ~1255 kB/s
[14:43:04] + Received work.
[14:43:04] + Closed connections
[14:43:09] 
[14:43:09] + Processing work unit
[14:43:09] Core required: FahCore_a1.exe
[14:43:09] Core found.
[14:43:09] Working on Unit 00 [August 31 14:43:09]
[14:43:09] + Working ...
[14:43:09] - Calling './mpiexec -np 4 -host 127.0.0.1 ./FahCore_a1.exe -dir work/ -suffix 00 -checkpoint 15 -forceasm -verbose -lifeline 5623 -version 602'

[14:43:09] 
[14:43:09] *------------------------------*
[14:43:09] Folding@Home Gromacs SMP Core
[14:43:09] Version 1.74 (November 27, 2006)
[14:43:09] 
[14:43:09] Preparing to commence simulation
[14:43:09] - Ensuring status. Please wait.
[14:43:26] - Assembly optimizations manually forced on.
[14:43:26] - Not checking prior termination.
[14:43:27] - Expanded 4735055 -> 24426905 (decompressed 515.8 percent)
[14:43:27] - Starting from initial work packet
[14:43:27] 
[14:43:27] Project: 2665 (Run 0, Clone 479, Gen 20)
[14:43:27] 
[14:43:27] Assembly optimizations on if available.
[14:43:27] Entering M.D.
[14:43:33] Rejecting checkpoint
[14:43:34] Protein: HGG in water
[14:43:35] xtra SSE boost OK.
[14:43:35] 
[14:43:35] Extra SSE boost OK.
[14:43:36] Writing local files
[14:43:36] Completed 0 out of 250000 steps  (0 percent)
[14:43:36] 
[14:43:36] Folding@home Core Shutdown: INTERRUPTED
[14:43:40] CoreStatus = 0 (0)
[14:43:40] Client-core communications error: ERROR 0x0
[14:43:40] Deleting current work unit & continuing...
[14:48:01] - Warning: Could not delete all work unit files (0): Core returned invalid code
[14:48:01] Trying to send all finished work units
[14:48:01] + No unsent completed units remaining.
[14:48:01] - Preparing to get new work unit...
[14:48:01] + Attempting to get work packet
[14:48:01] - Will indicate memory of 1000 MB
[14:48:01] - Connecting to assignment server
[14:48:01] Connecting to http://assign.stanford.edu:8080/
[14:48:02] Posted data.
[14:48:02] Initial: 40AB; - Successful: assigned to (171.64.65.64).
[14:48:02] + News From Folding@Home: Welcome to Folding@Home
[14:48:02] Loaded queue successfully.
[14:48:02] Connecting to http://171.64.65.64:8080/
[14:48:07] Posted data.
[14:48:07] Initial: 0000; - Receiving payload (expected size: 4735567)
[14:48:11] - Downloaded at ~1156 kB/s
[14:48:11] - Averaged speed for that direction ~1235 kB/s
[14:48:11] + Received work.
[14:48:11] + Closed connections
[14:48:16] 
[14:48:16] + Processing work unit
[14:48:16] Core required: FahCore_a1.exe
[14:48:16] Core found.
[14:48:16] Working on Unit 01 [August 31 14:48:16]
[14:48:16] + Working ...
[14:48:16] - Calling './mpiexec -np 4 -host 127.0.0.1 ./FahCore_a1.exe -dir work/ -suffix 01 -checkpoint 15 -forceasm -verbose -lifeline 5623 -version 602'

[14:48:16] 
[14:48:16] *------------------------------*
[14:48:16] Folding@Home Gromacs SMP Core
[14:48:16] Version 1.74 (November 27, 2006)
[14:48:16] 
[14:48:16] Preparing to commence simulation
[14:48:16] - Ensuring status. Please wait.
[14:48:34] - Assembly optimizations manually forced on.
[14:48:34] - Not checking prior termination.
[14:48:35] - Expanded 4735055 -> 24426905 (decompressed 515.8 percent)
[14:48:35] - Starting from initial work packet
[14:48:35] 
[14:48:35] Project: 2665 (Run 0, Clone 479, Gen 20)
[14:48:35] 
[14:48:35] Assembly optimizations on if available.
[14:48:35] Entering M.D.
[14:48:41] Rejecting checkpoint
[14:48:42] Protein: HGG in water
[14:48:42] xtra SSE boost OK.
[14:48:42] 
[14:48:43] Extra SSE boost OK.
[14:48:43] Writing local files
[14:48:43] Completed 0 out of 250000 steps  (0 percent)
[14:48:43] 
[14:48:43] Folding@home Core Shutdown: INTERRUPTED
[14:48:47] CoreStatus = 0 (0)
[14:48:47] Client-core communications error: ERROR 0x0
Would one of you Mods please change the topic of this thread to Project: 2665 (Run 0, Clone 479, Gen 20) - ERROR 0X0 if this post did not accomplish that.
Thanks!

Can we assign this one to the waste basket? :?: :?

Re: Project: 2665 (Run 0, Clone 479, Gen 20) - ERROR 0X0

Posted: Sun Aug 31, 2008 3:50 pm
by ChelseaOilman
314159 wrote:Did I mention that defective WUs are being assigned to the same machine that attempted to complete them previously - and often days later?
Uploading the wuresults_0x.dat files will decrease the chance you get these bad WUs reassigned to you. You'll probably need to use qfix like I did. I checked the WU database for this WU, your not listed.

Re: Project: 2665 (Run 0, Clone 479, Gen 20)

Posted: Sun Aug 31, 2008 5:18 pm
by 314159
No wuresults_x.dat file is generated in these cases - at least with the Linux Client.

I think that they need to kill this WU and re-run Run 0, Clone 479, Gen 19.

In any event, my expectation is that I should NOT have to watch the machines on my "farm" closely.
They should be "fed" WUs that are reasonably stable AND do not cause client hangs (if the WUs are the cause of this).
Client/Cores should be properly coded so that ALL errors are reported to Pande and partial credit awarded.
Mutiple assignments of defective WUs to the same machine should be eliminated immediately.

That said I understand a few things, namely:

1. Errors generated by this genre of code can be extremely difficult to troubleshoot.
2. Apparently, the Linux Client is FAR down in the Project's priorities - quite subordinate to the GPU and WIN SMP work.
3. I am not particularly pleased with the release of what is proported to be a "final combined client"/core(s) with this many bugs in it.
IMHO it is still a "beta" release.

4. I greatly appreciate the efforts of those people at Pande who are trying to stabilize things.
I also appreciate their responsiveness.

Fold on!

John

Re: Project: 2665 (Run 0, Clone 479, Gen 20)

Posted: Sun Aug 31, 2008 9:43 pm
by Baowoulf
314159 wrote:3. I am not particularly pleased with the release of what is proported to be a "final combined client"/core(s) with this many bugs in it.
IMHO it is still a "beta" release.
I thought SMP was still in beta? And that even the cpu version 6.22 that you can switch back and forth between cpu and smp clients was also and the only version 6 client in beta?

Re: Project: 2665 (Run 0, Clone 479, Gen 20)

Posted: Sun Aug 31, 2008 9:57 pm
by 314159
Linux
Linux (x86) and BSD *combined uniprocessor and SMP client* (64-bit required for SMP) 6.02

No "expiration date" that I know of and not labeled "beta" (as before).

You may, of course, be right. I am easily and gracefully correctable. :)

Are we talking about the same client?

Re: Project: 2665 (Run 0, Clone 479, Gen 20)

Posted: Sun Aug 31, 2008 10:07 pm
by Baowoulf
Turns out we're both right. The Windows SMP is still in beta but not the Linux one.

Re: Project: 2665 (Run 0, Clone 479, Gen 20)

Posted: Mon Sep 01, 2008 12:45 am
by 314159
:wink:

Re: Project: 2665 (Run 0, Clone 479, Gen 20)

Posted: Mon Sep 01, 2008 4:43 am
by arfyness
Have y'all tried running one instance per core? I ask cause I'm curious whether the Linux 6.02 client will work that way. That's what I do on my mom's winxp dual core anyway, with the 6.20 console/service version. It works fine. But that's in Windows, which sadly still seems to have higher priority. I chose that route to avoid possible beta SMP baloney. Besides, the CPU runs to its potential, which it didn't with the Windows SMP version. (It's also sad that I built my mom a computer way better than mine in every way when her most intensive tasks are email and freecell.) :e?:

But this is about Linux. I'm using the same (current Linux version) 6.02 on a uniprocessor (AthlonXP 3200+) and I'm having some similar trouble of my own. So maybe it's related?

I agree that it's a terrible thing to disenfranchise the user base, since this project clearly would be NOWHERE without a user base. On the other hand, I think the average Linux user tends to be a bit more patient with the process of working bugs out than the average Windows user. :lol:

The crash reporting process, on the other hand, should DEFINITELY be built into the client / server communication structure, whether directly to the Pande group, or via assignment servers. This seems to me a fairly obvious concept which still to escapes the attention it deserves. C'mon, I even have a bittorrent client (Miro) that calls home with crash-report details.

Next time it crashes, I'll try this qfix thing you guys are talking about.
(And it likely might; I'm still on the same work unit - Project: 781 (Run 0, Clone 83, Gen 2).) :x

-- Nate

Re: Project: 2665 (Run 0, Clone 479, Gen 20)

Posted: Mon Sep 01, 2008 2:16 pm
by crapiecorn

Code: Select all

Launch directory: /home/folding/folding
Executable: ./fah6
Arguments: -smp 

[14:56:06] - Ask before connecting: No
[14:56:06] - User name: StrikeTeam (Team 34517)
[14:56:06] - User ID: 6D2FDEB400C80E92
[14:56:06] - Machine ID: 2
[14:56:06] 
[14:56:06] Work directory not found. Creating...
[14:56:06] Could not open work queue, generating new queue...
[14:56:06] - Preparing to get new work unit...
[14:56:06] + Attempting to get work packet
[14:56:06] - Connecting to assignment server
[14:56:07] - Successful: assigned to (171.64.65.64).
[14:56:07] + News From Folding@Home: Welcome to Folding@Home
[14:56:07] Loaded queue successfully.
[14:56:39] + Closed connections
[14:56:39] 
[14:56:39] + Processing work unit
[14:56:39] Core required: FahCore_a1.exe
[14:56:39] Core found.
[14:56:39] Working on Unit 01 [September 1 14:56:39]
[14:56:39] + Working ...
[14:56:39] 
[14:56:39] *------------------------------*
[14:56:39] Folding@Home Gromacs SMP Core
[14:56:39] Version 1.74 (November 27, 2006)
[14:56:39] 
[14:56:39] Preparing to commence simulation
[14:56:39] - Ensuring status. Please wait.
[14:56:40] - Starting from initial work packet
[14:56:40] 
[14:56:40] Project: 2665 (Run 0, Clone 828, Gen 47)
[14:56:40] 
[14:56:41] Assembly optimizations on if available.
[14:56:41] Entering M.D.
[14:56:57]  percent)
[14:56:57] - Starting from initial work packet
[14:56:57] 
[14:56:57] Project: 2665 (Run 0, Clone 828, Gen 47)
[14:56:57] 
[14:56:57] Entering M.D.
[14:57:05] Protein: HGG in water
[14:57:05] Writing local files
[14:57:09] Extra SSE boost OK.
[14:58:16] Finalizing output
[14:58:20] CoreStatus = 0 (0)
[14:58:20] Client-core communications error: ERROR 0x0
[14:58:20] Deleting current work unit & continuing...

Folding@Home Client Shutdown.