Collection server dumping WUs

Moderators: Site Moderators, FAHC Science Team

Post Reply
uncle fuzzy
Posts: 460
Joined: Sun Dec 02, 2007 10:15 pm
Location: Michigan

Collection server dumping WUs

Post by uncle fuzzy »

One of my team mates reported this problem. He completed a WU and had the collection server dump it, then successfully uploaded the next, and once more had the server dump the third WU. All are P7903.

Code: Select all

14:26:30:WU01:FS00:0xa4:Completed 2400000 out of 2500000 steps (96%)
 14:30:35:WU01:FS00:0xa4:Completed 2425000 out of 2500000 steps (97%)
 14:34:40:WU01:FS00:0xa4:Completed 2450000 out of 2500000 steps (98%)
 14:38:46:WU01:FS00:0xa4:Completed 2475000 out of 2500000 steps (99%)
 14:38:47:WU00:FS00:Connecting to assign3.stanford.edu:8080
 14:38:47:WU00:FS00:News: Welcome to Folding@Home
 14:38:47:WU00:FS00:Assigned to work server 128.113.12.161
 14:38:47:WU00:FS00:Requesting new work unit for slot 00: RUNNING smp:8 from 128.113.12.161
 14:38:47:WU00:FS00:Connecting to 128.113.12.161:8080
 14:38:48:WU00:FS00:Downloading 646.38KiB
 14:38:48:WU00:FS00:Download complete
 14:38:48:WU00:FS00:Received Unit: id:00 state:DOWNLOAD error:OK project:7903 run:210 clone:9 gen:24 core:0xa4 unit:0x0000001900ac9c214eca68d81525fe45
 14:42:51:WU01:FS00:0xa4:Completed 2500000 out of 2500000 steps (100%)
 14:42:52:WU01:FS00:0xa4:DynamicWrapper: Finished Work Unit: sleep=10000
 14:43:02:WU01:FS00:0xa4:
 14:43:02:WU01:FS00:0xa4:Finished Work Unit:
 14:43:02:WU01:FS00:0xa4:- Reading up to 35910936 from "01/wudata_01.trr": Read 35910936
 14:43:02:WU01:FS00:0xa4:trr file hash check passed.
 14:43:02:WU01:FS00:0xa4:edr file hash check passed.
 14:43:02:WU01:FS00:0xa4:logfile size: 56875
 14:43:02:WU01:FS00:0xa4:Leaving Run
 14:43:05:WU01:FS00:0xa4:- Writing 35997727 bytes of core data to disk...
 14:43:10:WU01:FS00:0xa4:Done: 35997215 -> 30222790 (compressed to 83.9 percent)
 14:43:11:WU01:FS00:0xa4: ... Done.
 14:43:14:WU01:FS00:0xa4:- Shutting down core
 14:43:14:WU01:FS00:0xa4:
 14:43:14:WU01:FS00:0xa4:Folding@home Core Shutdown: FINISHED_UNIT
 14:43:14:WU01:FS00:FahCore returned: FINISHED_UNIT (100 = 0x64)
 14:43:14:WU01:FS00:Sending unit results: id:01 state:SEND error:OK project:7903 run:216 clone:8 gen:16 core:0xa4 unit:0x0000001400ac9c214eca68e095e47aac
 14:43:14:WU01:FS00:Uploading 28.82MiB to 128.113.12.161
 14:43:14:WU01:FS00:Connecting to 128.113.12.161:8080
 14:43:14:WU00:FS00:Starting
 14:43:14:WU00:FS00:Running FahCore: "C:\Program Files (x86)\FAHClient/FAHCoreWrapper.exe" "C:/Users/User/AppData/Roaming/FAHClient/cores/www.stanford.edu/~pande/Win32/AMD64/Core_a4.fah/FahCore_a4.exe" -dir 00 -suffix 01 -version 701 -lifeline 1312 -checkpoint 15 -np 8
 14:43:14:WU00:FS00:Started FahCore on PID 1340
 14:43:14:WU00:FS00:Core PID:3120
 14:43:14:WU00:FS00:FahCore 0xa4 started
 14:43:15:WU00:FS00:0xa4:
 14:43:15:WU00:FS00:0xa4:*------------------------------*
 14:43:15:WU00:FS00:0xa4:Folding@Home Gromacs GB Core
 14:43:15:WU00:FS00:0xa4:Version 2.27 (Dec. 15, 2010)
 14:43:15:WU00:FS00:0xa4:
 14:43:15:WU00:FS00:0xa4:Preparing to commence simulation
 14:43:15:WU00:FS00:0xa4:- Looking at optimizations...
 14:43:15:WU00:FS00:0xa4:- Created dyn
 14:43:15:WU00:FS00:0xa4:- Files status OK
 14:43:15:WU00:FS00:0xa4:- Expanded 661380 -> 1008860 (decompressed 152.5 percent)
 14:43:15:WU00:FS00:0xa4:Called DecompressByteArray: compressed_data_size=661380 data_size=1008860, decompressed_data_size=1008860 diff=0
 14:43:15:WU00:FS00:0xa4:- Digital signature verified
 14:43:15:WU00:FS00:0xa4:
 14:43:15:WU00:FS00:0xa4:Project: 7903 (Run 210, Clone 9, Gen 24)
 14:43:15:WU00:FS00:0xa4:
 14:43:15:WU00:FS00:0xa4:Assembly optimizations on if available.
 14:43:15:WU00:FS00:0xa4:Entering M.D.
 14:43:21:WU00:FS00:0xa4:Mapping NT from 8 to 8 
14:43:21:WU00:FS00:0xa4:Completed 0 out of 2500000 steps (0%)
 14:43:57:WU01:FS00:Upload 76.98%
 14:43:57:WARNING:WU01:FS00:Exception: Failed to send results to work server: Transfer failed
 14:43:57:WU01:FS00:Trying to send results to collection server
 14:43:57:WU01:FS00:Uploading 28.82MiB to 129.74.85.16
 14:43:57:WU01:FS00:Connecting to 129.74.85.16:8080
 14:44:03:WU01:FS00:Upload 30.79%
 14:44:09:WU01:FS00:Upload 62.45%
 14:44:15:WU01:FS00:Upload 94.54%
 14:44:16:WU01:FS00:Upload complete
 14:44:16:WU01:FS00:Server responded WORK_QUIT (404)
 14:44:16:WARNING:WU01:FS00:Server did not like results, dumping
 14:44:16:WU01:FS00:Cleaning up 
codysluder
Posts: 1024
Joined: Sun Dec 02, 2007 12:43 pm

Re: Collection server dumping WUs

Post by codysluder »

As a general rule, it's a good idea to report the first error, not the second one. The Work Server failed before the Collection Server did. I would guess that the WU is corrupt.

Why else would it say:

14:43:57:WU01:FS00:Upload 76.98%
14:43:57:WARNING:WU01:FS00:Exception: Failed to send results to work server: Transfer failed
:?:
uncle fuzzy
Posts: 460
Joined: Sun Dec 02, 2007 10:15 pm
Location: Michigan

Re: Collection server dumping WUs

Post by uncle fuzzy »

I'm still running v6, so the wording didn't strike me as significant. I've often had WUs fail to go to the work server and end up going to the collection server.

Would you see the same wording if there was a problem with the WS and you had to go to the CS, or does this definitely indicate a problem with the WU/data/upload?
codysluder
Posts: 1024
Joined: Sun Dec 02, 2007 12:43 pm

Re: Collection server dumping WUs

Post by codysluder »

A bad result can be rejected by both a WS and CS and that looks like what happened here, though I'm certainly not sure.

A good WU can fail to go to a WS because the WS is down and then successfully go to the CS. The message about 76.98% does prove the WS accepted part of the WU, which implies the WS was not down.

It's clearer if you look only look at WU01 and ignore the messages about WU00.

Code: Select all

    14:26:30:WU01:FS00:0xa4:Completed 2400000 out of 2500000 steps (96%)
    14:30:35:WU01:FS00:0xa4:Completed 2425000 out of 2500000 steps (97%)
    14:34:40:WU01:FS00:0xa4:Completed 2450000 out of 2500000 steps (98%)
    14:38:46:WU01:FS00:0xa4:Completed 2475000 out of 2500000 steps (99%)
    14:42:51:WU01:FS00:0xa4:Completed 2500000 out of 2500000 steps (100%)
    14:42:52:WU01:FS00:0xa4:DynamicWrapper: Finished Work Unit: sleep=10000
    14:43:02:WU01:FS00:0xa4:
    14:43:02:WU01:FS00:0xa4:Finished Work Unit:
    14:43:02:WU01:FS00:0xa4:- Reading up to 35910936 from "01/wudata_01.trr": Read 35910936
    14:43:02:WU01:FS00:0xa4:trr file hash check passed.
    14:43:02:WU01:FS00:0xa4:edr file hash check passed.
    14:43:02:WU01:FS00:0xa4:logfile size: 56875
    14:43:02:WU01:FS00:0xa4:Leaving Run
    14:43:05:WU01:FS00:0xa4:- Writing 35997727 bytes of core data to disk...
    14:43:10:WU01:FS00:0xa4:Done: 35997215 -> 30222790 (compressed to 83.9 percent)
    14:43:11:WU01:FS00:0xa4: ... Done.
    14:43:14:WU01:FS00:0xa4:- Shutting down core
    14:43:14:WU01:FS00:0xa4:
    14:43:14:WU01:FS00:0xa4:Folding@home Core Shutdown: FINISHED_UNIT
    14:43:14:WU01:FS00:FahCore returned: FINISHED_UNIT (100 = 0x64)
    14:43:14:WU01:FS00:Sending unit results: id:01 state:SEND error:OK project:7903 run:216 clone:8 gen:16 core:0xa4 unit:0x0000001400ac9c214eca68e095e47aac
    14:43:14:WU01:FS00:Uploading 28.82MiB to 128.113.12.161
    14:43:14:WU01:FS00:Connecting to 128.113.12.161:8080
    14:43:57:WU01:FS00:Upload 76.98%
    14:43:57:WARNING:WU01:FS00:Exception: Failed to send results to work server: Transfer failed
    14:43:57:WU01:FS00:Trying to send results to collection server
    14:43:57:WU01:FS00:Uploading 28.82MiB to 129.74.85.16
    14:43:57:WU01:FS00:Connecting to 129.74.85.16:8080
    14:44:03:WU01:FS00:Upload 30.79%
    14:44:09:WU01:FS00:Upload 62.45%
    14:44:15:WU01:FS00:Upload 94.54%
    14:44:16:WU01:FS00:Upload complete
    14:44:16:WU01:FS00:Server responded WORK_QUIT (404)
    14:44:16:WARNING:WU01:FS00:Server did not like results, dumping
    14:44:16:WU01:FS00:Cleaning up
Toccatta
Posts: 1
Joined: Sat Jun 09, 2012 1:41 pm

Re: Collection server dumping WUs

Post by Toccatta »

And now another of his teammates is having a similar problem, but with project 7000.

Code: Select all

06:55:37:WU00:FS00:0xa4:Completed 9600000 out of 10000000 steps  (96%)
07:02:12:WU00:FS00:0xa4:Completed 9700000 out of 10000000 steps  (97%)
07:08:47:WU00:FS00:0xa4:Completed 9800000 out of 10000000 steps  (98%)
07:15:23:WU00:FS00:0xa4:Completed 9900000 out of 10000000 steps  (99%)
07:21:57:WU00:FS00:0xa4:Completed 10000000 out of 10000000 steps  (100%)
07:21:58:WU00:FS00:0xa4:DynamicWrapper: Finished Work Unit: sleep=10000
07:22:08:WU00:FS00:0xa4:
07:22:08:WU00:FS00:0xa4:Finished Work Unit:
07:22:08:WU00:FS00:0xa4:- Reading up to 2128272 from "00/wudata_01.trr": Read 2128272
07:22:08:WU00:FS00:0xa4:trr file hash check passed.
07:22:08:WU00:FS00:0xa4:- Reading up to 221796 from "00/wudata_01.xtc": Read 221796
07:22:08:WU00:FS00:0xa4:xtc file hash check passed.
07:22:08:WU00:FS00:0xa4:edr file hash check passed.
07:22:08:WU00:FS00:0xa4:logfile size: 81409
07:22:08:WU00:FS00:0xa4:Leaving Run
07:22:09:WU00:FS00:0xa4:- Writing 2455949 bytes of core data to disk...
07:22:10:WU00:FS00:0xa4:Done: 2455437 -> 1862197 (compressed to 75.8 percent)
07:22:10:WU00:FS00:0xa4:  ... Done.
07:22:10:WU00:FS00:0xa4:- Shutting down core
07:22:10:WU00:FS00:0xa4:
07:22:10:WU00:FS00:0xa4:Folding@home Core Shutdown: FINISHED_UNIT
07:22:11:WU00:FS00:FahCore returned: FINISHED_UNIT (100 = 0x64)
07:22:11:WU00:FS00:Sending unit results: id:00 state:SEND error:OK project:7000 run:2 clone:4 gen:94 core:0xa4 unit:0x000000d00001329c4dfb826e99a01e6a
07:22:11:WU00:FS00:Uploading 1.78MiB to 129.74.85.15
07:22:11:WU00:FS00:Connecting to 129.74.85.15:8080
07:22:51:WU00:FS00:Upload 21.11%
07:22:51:WARNING:WU00:FS00:Exception: Failed to send results to work server: Transfer failed
07:22:51:WU00:FS00:Trying to send results to collection server
07:22:51:WU00:FS00:Uploading 1.78MiB to 129.74.85.16
07:22:51:WU00:FS00:Connecting to 129.74.85.16:8080
07:22:56:WARNING:WU00:FS00:WorkServer connection failed on port 8080 trying 80
07:22:56:WU00:FS00:Connecting to 129.74.85.16:80
07:23:03:ERROR:WU00:FS00:Exception: Failed to connect to 129.74.85.16:80: No connection could be made because the target machine actively refused it.
07:23:03:WU00:FS00:Sending unit results: id:00 state:SEND error:OK project:7000 run:2 clone:4 gen:94 core:0xa4 unit:0x000000d00001329c4dfb826e99a01e6a
07:23:03:WU00:FS00:Uploading 1.78MiB to 129.74.85.15
07:23:03:WU00:FS00:Connecting to 129.74.85.15:8080
07:23:09:WARNING:WU00:FS00:WorkServer connection failed on port 8080 trying 80
07:23:09:WU00:FS00:Connecting to 129.74.85.15:80
07:23:16:WARNING:WU00:FS00:Exception: Failed to send results to work server: Failed to connect to 129.74.85.15:80: No connection could be made because the target machine actively refused it.
07:23:16:WU00:FS00:Trying to send results to collection server
07:23:16:WU00:FS00:Uploading 1.78MiB to 129.74.85.16
07:23:16:WU00:FS00:Connecting to 129.74.85.16:8080
07:23:23:WARNING:WU00:FS00:WorkServer connection failed on port 8080 trying 80
07:23:23:WU00:FS00:Connecting to 129.74.85.16:80
07:23:29:ERROR:WU00:FS00:Exception: Failed to connect to 129.74.85.16:80: No connection could be made because the target machine actively refused it.
07:24:03:WU00:FS00:Sending unit results: id:00 state:SEND error:OK project:7000 run:2 clone:4 gen:94 core:0xa4 unit:0x000000d00001329c4dfb826e99a01e6a
07:24:03:WU00:FS00:Uploading 1.78MiB to 129.74.85.15
07:24:03:WU00:FS00:Connecting to 129.74.85.15:8080
07:24:09:WARNING:WU00:FS00:WorkServer connection failed on port 8080 trying 80
07:24:09:WU00:FS00:Connecting to 129.74.85.15:80
07:24:15:WARNING:WU00:FS00:Exception: Failed to send results to work server: Failed to connect to 129.74.85.15:80: No connection could be made because the target machine actively refused it.
07:24:15:WU00:FS00:Trying to send results to collection server
07:24:15:WU00:FS00:Uploading 1.78MiB to 129.74.85.16
07:24:15:WU00:FS00:Connecting to 129.74.85.16:8080
07:24:22:WARNING:WU00:FS00:WorkServer connection failed on port 8080 trying 80
07:24:22:WU00:FS00:Connecting to 129.74.85.16:80
07:24:29:ERROR:WU00:FS00:Exception: Failed to connect to 129.74.85.16:80: No connection could be made because the target machine actively refused it.
07:25:41:WU00:FS00:Sending unit results: id:00 state:SEND error:OK project:7000 run:2 clone:4 gen:94 core:0xa4 unit:0x000000d00001329c4dfb826e99a01e6a
07:25:41:WU00:FS00:Uploading 1.78MiB to 129.74.85.15
07:25:41:WU00:FS00:Connecting to 129.74.85.15:8080
07:25:47:WARNING:WU00:FS00:WorkServer connection failed on port 8080 trying 80
07:25:47:WU00:FS00:Connecting to 129.74.85.15:80
07:25:53:WARNING:WU00:FS00:Exception: Failed to send results to work server: Failed to connect to 129.74.85.15:80: No connection could be made because the target machine actively refused it.
07:25:53:WU00:FS00:Trying to send results to collection server
07:25:53:WU00:FS00:Uploading 1.78MiB to 129.74.85.16
07:25:53:WU00:FS00:Connecting to 129.74.85.16:8080
07:26:00:WARNING:WU00:FS00:WorkServer connection failed on port 8080 trying 80
07:26:00:WU00:FS00:Connecting to 129.74.85.16:80
07:26:07:ERROR:WU00:FS00:Exception: Failed to connect to 129.74.85.16:80: No connection could be made because the target machine actively refused it.
07:28:18:WU00:FS00:Sending unit results: id:00 state:SEND error:OK project:7000 run:2 clone:4 gen:94 core:0xa4 unit:0x000000d00001329c4dfb826e99a01e6a
07:28:18:WU00:FS00:Uploading 1.78MiB to 129.74.85.15
07:28:18:WU00:FS00:Connecting to 129.74.85.15:8080
07:28:24:WARNING:WU00:FS00:WorkServer connection failed on port 8080 trying 80
07:28:24:WU00:FS00:Connecting to 129.74.85.15:80
07:28:31:WARNING:WU00:FS00:Exception: Failed to send results to work server: Failed to connect to 129.74.85.15:80: No connection could be made because the target machine actively refused it.
07:28:31:WU00:FS00:Trying to send results to collection server
07:28:31:WU00:FS00:Uploading 1.78MiB to 129.74.85.16
07:28:31:WU00:FS00:Connecting to 129.74.85.16:8080
07:28:37:WARNING:WU00:FS00:WorkServer connection failed on port 8080 trying 80
07:28:37:WU00:FS00:Connecting to 129.74.85.16:80
07:28:44:ERROR:WU00:FS00:Exception: Failed to connect to 129.74.85.16:80: No connection could be made because the target machine actively refused it.
07:32:32:WU00:FS00:Sending unit results: id:00 state:SEND error:OK project:7000 run:2 clone:4 gen:94 core:0xa4 unit:0x000000d00001329c4dfb826e99a01e6a
07:32:32:WU00:FS00:Uploading 1.78MiB to 129.74.85.15
07:32:32:WU00:FS00:Connecting to 129.74.85.15:8080
07:32:38:WARNING:WU00:FS00:WorkServer connection failed on port 8080 trying 80
07:32:38:WU00:FS00:Connecting to 129.74.85.15:80
07:33:19:WU00:FS00:Upload 3.52%
07:33:19:WARNING:WU00:FS00:Exception: Failed to send results to work server: Transfer failed
07:33:19:WU00:FS00:Trying to send results to collection server
07:33:19:WU00:FS00:Uploading 1.78MiB to 129.74.85.16
07:33:19:WU00:FS00:Connecting to 129.74.85.16:8080
07:33:25:WU00:FS00:Upload 84.44%
07:33:26:WU00:FS00:Upload complete
07:33:26:WU00:FS00:Server responded WORK_QUIT (404)
07:33:26:WARNING:WU00:FS00:Server did not like results, dumping
07:33:26:WU00:FS00:Cleaning up
The number of refusals to both the work and connection server seems to imply some network connectivity issues. It's hard to believe they're entirely coincidental and the problem is actually with a corrupted WU.

[edit] Oh, by the way - SMP core, Windows XP, Client v7.1.52, Q6600, 3Gb ram, no overclock
Post Reply