Page 1 of 1

Failed unit after 1 long week... Project: 10722 (Run 0, Clon

Posted: Tue Feb 07, 2012 9:23 am
by new08
Project: 10722 (Run 0, Clone 3171, Gen 1) took over a week to do and was worth 5000 points or so.
When it finished it had a problem and gave up! Was this an A4 or GB core- did they get crossed?
2 hours per step is the longest I've seen- on the 'not that slow' Duo e6600 core- going flat out.

It is overclocked, but otherwise stable with no issues on the 'Panther check list' occurring- and folding on the GPU, and both CPU cores now OK.
The unit crashed a few times setting up the new cores- but always carried on working without any warnings or variation in step times.
Disappointing if lost as a result.. but maybe worth reporting here.

NB: Missed from log: Client- Windows CPU Systray Edition Folding@Home Client Version 6.23

Code: Select all

[22:17:30] - Machine ID: 3
[22:17:30] 
[22:17:30] Loaded queue successfully.
[22:17:30] Initialization complete
[22:17:30] 
[22:17:30] + Processing work unit
[22:17:30] Core required: FahCore_a4.exe
[22:17:30] Core found.
[22:17:30] Working on queue slot 00 [February 2 22:17:30 UTC]
[22:17:30] + Working ...
[22:17:34] 
[22:17:34] *------------------------------*
[22:17:34] Folding@Home Gromacs GB Core
[22:17:34] Version 2.27 (Dec. 15, 2010)
[22:17:35] 
[22:17:35] Preparing to commence simulation
[22:17:35] - Ensuring status. Please wait.
[22:17:44] - Looking at optimizations...
[22:17:44] - Working with standard loops on this execution.
[22:17:44] Examination of work files indicates 8 consecutive improper terminations of core.
[22:17:44] - Expanded 271142 -> 354128 (decompressed 130.6 percent)
[22:17:44] Called DecompressByteArray: compressed_data_size=271142 data_size=354128, decompressed_data_size=354128 diff=0
[22:17:44] - Digital signature verified
[22:17:44] 
[22:17:44] Project: 10722 (Run 0, Clone 3171, Gen 1)
[22:17:44] 
[22:17:44] Entering M.D.
[22:17:50] Using Gromacs checkpoints
[22:17:50] Mapping NT from 1 to 1 
[22:17:51] Resuming from checkpoint
[22:17:51] Verified work/wudata_00.log
[22:17:52] Verified work/wudata_00.trr
[22:17:52] Verified work/wudata_00.xtc
[22:17:52] Verified work/wudata_00.edr
[22:17:52] Completed 3147040 out of 7000000 steps  (44%)
[22:22:52] Completed 3150000 out of 7000000 steps  (45%)
[00:16:43] Completed 3220000 out of 7000000 steps  (46%)
[02:00:35] Completed 3290000 out of 7000000 steps  (47%)
[03:44:34] Completed 3360000 out of 7000000 steps  (48%)
[04:17:30] + Working...
[05:30:53] Completed 3430000 out of 7000000 steps  (49%)
[07:17:47] Completed 3500000 out of 7000000 steps  (50%)
[09:04:39] Completed 3570000 out of 7000000 steps  (51%)
[10:17:30] + Working...
[10:55:48] Completed 3640000 out of 7000000 steps  (52%)
[12:49:58] Completed 3710000 out of 7000000 steps  (53%)
[14:36:18] Completed 3780000 out of 7000000 steps  (54%)
[16:17:30] + Working...
[16:23:32] Completed 3850000 out of 7000000 steps  (55%)
[18:12:23] Completed 3920000 out of 7000000 steps  (56%)
[20:08:31] Completed 3990000 out of 7000000 steps  (57%)
[22:03:51] Completed 4060000 out of 7000000 steps  (58%)
[22:17:30] + Working...
[23:50:04] Completed 4130000 out of 7000000 steps  (59%)
[01:36:14] Completed 4200000 out of 7000000 steps  (60%)
[03:23:46] Completed 4270000 out of 7000000 steps  (61%)
[04:17:30] + Working...
[05:09:38] Completed 4340000 out of 7000000 steps  (62%)
[06:55:29] Completed 4410000 out of 7000000 steps  (63%)
[08:43:21] Completed 4480000 out of 7000000 steps  (64%)
[10:17:30] + Working...
[10:32:24] Completed 4550000 out of 7000000 steps  (65%)
[12:19:13] Completed 4620000 out of 7000000 steps  (66%)
[14:07:14] Completed 4690000 out of 7000000 steps  (67%)
[15:53:50] Completed 4760000 out of 7000000 steps  (68%)
[16:17:30] + Working...
[17:24:28] Completed 4830000 out of 7000000 steps  (69%)
[18:55:43] Completed 4900000 out of 7000000 steps  (70%)
[20:27:00] Completed 4970000 out of 7000000 steps  (71%)
[21:58:46] Completed 5040000 out of 7000000 steps  (72%)
[22:17:30] + Working...
[23:30:03] Completed 5110000 out of 7000000 steps  (73%)
[01:01:23] Completed 5180000 out of 7000000 steps  (74%)
[02:31:42] Completed 5250000 out of 7000000 steps  (75%)
[04:02:07] Completed 5320000 out of 7000000 steps  (76%)
[04:17:30] + Working...
[05:32:27] Completed 5390000 out of 7000000 steps  (77%)
[07:03:31] Completed 5460000 out of 7000000 steps  (78%)
[08:34:00] Completed 5530000 out of 7000000 steps  (79%)
[10:04:28] Completed 5600000 out of 7000000 steps  (80%)
[10:17:30] + Working...
[11:49:45] Completed 5670000 out of 7000000 steps  (81%)
[13:53:49] Completed 5740000 out of 7000000 steps  (82%)
[15:59:37] Completed 5810000 out of 7000000 steps  (83%)
[16:17:30] + Working...
[17:57:06] Completed 5880000 out of 7000000 steps  (84%)
[19:42:44] Completed 5950000 out of 7000000 steps  (85%)
[21:54:50] Completed 6020000 out of 7000000 steps  (86%)
[22:17:30] + Working...
[00:10:39] Completed 6090000 out of 7000000 steps  (87%)
[02:17:47] Completed 6160000 out of 7000000 steps  (88%)
[04:16:35] Completed 6230000 out of 7000000 steps  (89%)
[04:17:30] + Working...
[06:15:57] Completed 6300000 out of 7000000 steps  (90%)
[08:23:42] Completed 6370000 out of 7000000 steps  (91%)
[10:17:30] + Working...
[10:28:09] Completed 6440000 out of 7000000 steps  (92%)
[12:26:50] Completed 6510000 out of 7000000 steps  (93%)
[14:29:47] Completed 6580000 out of 7000000 steps  (94%)
[16:17:30] + Working...
[16:27:54] Completed 6650000 out of 7000000 steps  (95%)
[18:27:33] Completed 6720000 out of 7000000 steps  (96%)
[20:42:20] Completed 6790000 out of 7000000 steps  (97%)
[22:17:30] + Working...
[22:37:58] Completed 6860000 out of 7000000 steps  (98%)
[00:36:31] Completed 6930000 out of 7000000 steps  (99%)
[02:27:20] Completed 7000000 out of 7000000 steps  (100%)
[02:27:21] DynamicWrapper: Finished Work Unit: sleep=10000
[02:27:31] 
[02:27:31] Finished Work Unit:
[02:27:31] - Reading up to 6825192 from "work/wudata_00.trr": Read 6825192
[02:27:31] - Checksum of file (work/wudata_00.trr) read from disk doesn't match
[02:27:31] 
[02:27:31] Folding@home Core Shutdown: FILE_IO_ERROR
[02:27:35] CoreStatus = 75 (117)
[02:27:35] Error opening or reading from a file.
[02:27:35] Deleting current work unit & continuing...

Re: Failed unit after 1 long week... Project: 10722 (Run 0,

Posted: Tue Feb 07, 2012 2:01 pm
by sortofageek
Thanks for the report. That one appears to be a bad WU. I reported it, but I'm not sure the report went through. I'll bring that to the attention of Pande Group.

Re: Failed unit after 1 long week... Project: 10722 (Run 0,

Posted: Tue Feb 07, 2012 5:07 pm
by sortofageek
Thanks to Macaholic, the WU has been reported.
The WU (P10722,R0,C3171,G1) has been reported as a bad WU. Note that the list of reported WUs are stopped daily at 8am pacific time.

Re: Failed unit after 1 long week... Project: 10722 (Run 0,

Posted: Wed Feb 08, 2012 11:42 am
by new08
I think I found an anomaly that could explain this failed unit.
For unknown reasons the cpu config on one client had changed to 1, from 3 previously.
I have a double click sensitivity on my mouse that sometimes starts the client twice from the short cut on desktop.
I noticed a few of these events which terminated with a warning that a client with same ident was already running.
Thus, I was off guard that the config of CPU2 was using the m/c ID 1- the same as the GPU [main workhorse client, usually running].
Maybe ,as my system is getting 3 or 4 times faster with CPU upgrades , the double click problem, irritating but minor in itself, has contributed to the glitch on units starting- and with a much faster system response, too.
I've corrected the config issue now- I always previously ran with 3 IDs for safety, as there's no problem with running short of idents on my rig.
I suppose the results on that unit are no good ?
- but it still may have a problem , of course...this is just an update on system function rather than a cure-all!
*** I'm not really sure if a machine number can be shared, if they are GPU and CPU clients , but it could explain why the A4 and GB cores got a mention on the same log file.

Re: Failed unit after 1 long week... Project: 10722 (Run 0,

Posted: Wed Feb 08, 2012 1:59 pm
by *hondo*
new08 wrote:I have a double click sensitivity on my mouse that sometimes starts the client twice from the short cut on desktop.
I don't know if the above is the only issue, however what I do to avoid the same issue occuring is Right click + Open

Hope this helps :)

Re: Failed unit after 1 long week... Project: 10722 (Run 0,

Posted: Wed Feb 08, 2012 2:15 pm
by new08
Yeah, Hondo- I'll do that. I did have a couple of scripts to cure the mouse sensitivity, but after working for a while they decided to lose interest and refuse to re- run now- so I've yet to find a generic solution [Win XP SP3]

Re: Failed unit after 1 long week... Project: 10722 (Run 0,

Posted: Thu Feb 09, 2012 6:35 am
by new08
I know this is not the original unit for this thread ,but the next unit on that same core got through in one go.
What happens?- it completes OK, as done many times on these units -and then fails to upload.
I have the results file still -and a txt logfile in the results folder as follows:

Code: Select all

*------------------------------*
Folding@Home Gromacs GB Core
Version 2.27 (Dec. 15, 2010)

Preparing to commence simulation
- Ensuring status. Please wait.
- Looking at optimizations...
- Working with standard loops on this execution.
- Previous termination of core was improper.
- Files status OK
- Expanded 50669 -> 197152 (decompressed 389.0 percent)
Called DecompressByteArray: compressed_data_size=50669 data_size=197152, decompressed_data_size=197152 diff=0
- Digital signature verified

Project: 7016 (Run 1, Clone 21, Gen 87)

Entering M.D.
Mapping NT from 1 to 1 
Completed 0 out of 10000000 steps  (0%)
Completed 100000 out of 10000000 steps  (1%)
Completed 200000 out of 10000000 steps  (2%)
Completed 300000 out of 10000000 steps  (3%)
Completed 400000 out of 10000000 steps  (4%)
Completed 500000 out of 10000000 steps  (5%)
Completed 600000 out of 10000000 steps  (6%)
Completed 700000 out of 10000000 steps  (7%)
Completed 800000 out of 10000000 steps  (8%)
Completed 900000 out of 10000000 steps  (9%)
Completed 1000000 out of 10000000 steps  (10%)
Completed 1100000 out of 10000000 steps  (11%)
Completed 1200000 out of 10000000 steps  (12%)
Completed 1300000 out of 10000000 steps  (13%)
Completed 1400000 out of 10000000 steps  (14%)
Completed 1500000 out of 10000000 steps  (15%)
Completed 1600000 out of 10000000 steps  (16%)
Completed 1700000 out of 10000000 steps  (17%)
Completed 1800000 out of 10000000 steps  (18%)
Completed 1900000 out of 10000000 steps  (19%)
Completed 2000000 out of 10000000 steps  (20%)
Completed 2100000 out of 10000000 steps  (21%)
Completed 2200000 out of 10000000 steps  (22%)
Completed 2300000 out of 10000000 steps  (23%)
Completed 2400000 out of 10000000 steps  (24%)
Completed 2500000 out of 10000000 steps  (25%)
Completed 2600000 out of 10000000 steps  (26%)
Completed 2700000 out of 10000000 steps  (27%)
Completed 2800000 out of 10000000 steps  (28%)
Completed 2900000 out of 10000000 steps  (29%)
Completed 3000000 out of 10000000 steps  (30%)
Completed 3100000 out of 10000000 steps  (31%)
Completed 3200000 out of 10000000 steps  (32%)
Completed 3300000 out of 10000000 steps  (33%)
Completed 3400000 out of 10000000 steps  (34%)
Completed 3500000 out of 10000000 steps  (35%)
Completed 3600000 out of 10000000 steps  (36%)
Completed 3700000 out of 10000000 steps  (37%)
Completed 3800000 out of 10000000 steps  (38%)
Completed 3900000 out of 10000000 steps  (39%)
Completed 4000000 out of 10000000 steps  (40%)
Completed 4100000 out of 10000000 steps  (41%)
Completed 4200000 out of 10000000 steps  (42%)
Completed 4300000 out of 10000000 steps  (43%)
Completed 4400000 out of 10000000 steps  (44%)
Completed 4500000 out of 10000000 steps  (45%)
Completed 4600000 out of 10000000 steps  (46%)
Completed 4700000 out of 10000000 steps  (47%)
Completed 4800000 out of 10000000 steps  (48%)
Completed 4900000 out of 10000000 steps  (49%)
Completed 5000000 out of 10000000 steps  (50%)
Completed 5100000 out of 10000000 steps  (51%)
Completed 5200000 out of 10000000 steps  (52%)
Completed 5300000 out of 10000000 steps  (53%)
Completed 5400000 out of 10000000 steps  (54%)
Completed 5500000 out of 10000000 steps  (55%)
Completed 5600000 out of 10000000 steps  (56%)
Completed 5700000 out of 10000000 steps  (57%)
Completed 5800000 out of 10000000 steps  (58%)
Completed 5900000 out of 10000000 steps  (59%)
Completed 6000000 out of 10000000 steps  (60%)
Completed 6100000 out of 10000000 steps  (61%)
Completed 6200000 out of 10000000 steps  (62%)
Completed 6300000 out of 10000000 steps  (63%)
Completed 6400000 out of 10000000 steps  (64%)
Completed 6500000 out of 10000000 steps  (65%)
Completed 6600000 out of 10000000 steps  (66%)
Completed 6700000 out of 10000000 steps  (67%)
Completed 6800000 out of 10000000 steps  (68%)
Completed 6900000 out of 10000000 steps  (69%)
Completed 7000000 out of 10000000 steps  (70%)
Completed 7100000 out of 10000000 steps  (71%)
Completed 7200000 out of 10000000 steps  (72%)
Completed 7300000 out of 10000000 steps  (73%)
Completed 7400000 out of 10000000 steps  (74%)
Completed 7500000 out of 10000000 steps  (75%)
Completed 7600000 out of 10000000 steps  (76%)
Completed 7700000 out of 10000000 steps  (77%)
Completed 7800000 out of 10000000 steps  (78%)
Completed 7900000 out of 10000000 steps  (79%)
Completed 8000000 out of 10000000 steps  (80%)
Completed 8100000 out of 10000000 steps  (81%)
Completed 8200000 out of 10000000 steps  (82%)
Completed 8300000 out of 10000000 steps  (83%)
Completed 8400000 out of 10000000 steps  (84%)
Completed 8500000 out of 10000000 steps  (85%)
Completed 8600000 out of 10000000 steps  (86%)
Completed 8700000 out of 10000000 steps  (87%)
Completed 8800000 out of 10000000 steps  (88%)
Completed 8900000 out of 10000000 steps  (89%)
Completed 9000000 out of 10000000 steps  (90%)
Completed 9100000 out of 10000000 steps  (91%)
Completed 9200000 out of 10000000 steps  (92%)
Completed 9300000 out of 10000000 steps  (93%)
Completed 9400000 out of 10000000 steps  (94%)
Completed 9500000 out of 10000000 steps  (95%)
Completed 9600000 out of 10000000 steps  (96%)
Completed 9700000 out of 10000000 steps  (97%)
Completed 9800000 out of 10000000 steps  (98%)
Completed 9900000 out of 10000000 steps  (99%)
Completed 10000000 out of 10000000 steps  (100%)
DynamicWrapper: Finished Work Unit: sleep=10000

Finished Work Unit:
- Reading up to 2026464 from "work/wudata_01.trr": Read 2026464
trr file hash check passed.
- Reading up to 210856 from "work/wudata_01.xtc": Read 210856
xtc file hash check passed.
edr file hash check passed.
logfile size: 80711
Leaving Run
- Writing 2342503 bytes of core data to disk...
Done: 2341991 -> 1548484 (compressed to 66.1 percent)
  ... Done.
- Shutting down core

Folding@home Core Shutdown: FINISHED_UNIT


From my [usual, not workfile] text file report was the added after: FINISHED_UNIT
[21:01:17] CoreStatus = 64 (100)
[21:01:17] Sending work to server
[21:01:17] Project: 7016 (Run 1, Clone 21, Gen 87)

[21:01:17] + Attempting to send results [February 8 21:01:17 UTC]
[21:01:30] - Server reports problem with unit.
[21:01:30] + Closed connections
[21:01:30] + Paused after finishing unit

Folding@Home Client Shutdown.

Now, I have to ask- is the results file any use and can it be uploaded?[It's 1.5 Mb]

The only thing that comes to mind is that the inability to shut down or pause on this core may be leaving a bad trace on restart.
It seems that using standard loops can't clear on a restart so goes back to zero work done -
Even if this doesn't occur when the unit finishes- a glitch occurs that the results server doesn't like.
The other core is fine for this, so not a connection problem.

Should I delete the core and let it reload after finishes current unit in two days?
This not uploading will make about 10 days work on one core lost.
PS: I noticed and corrected that the Bonus password was missing from config- I don't think this would affect uploading of the raw results.

Re: Failed unit after 1 long week... Project: 10722 (Run 0,

Posted: Fri Feb 10, 2012 4:47 am
by codysluder
Starting two V6 clients in the same directory will always cause something to fail. i think they have made it impossible to do that in V7.

Re: Failed unit after 1 long week... Project: 10722 (Run 0,

Posted: Fri Feb 10, 2012 5:18 am
by new08
Yeah, Cody! I have the two cpu clients in separate disks just to remind me.
Mtm has commented on the other thread but I'm going to switch my response to here.
Historically, I had a client set up on C the boot loader disk, but not the O/S holder. I was just happy that XP reloaded OK on processer upgrades but I did find O/C limited by a disk read failure, rather than overstressed cpu cores.
I have looked into switching the ntldr to the D drive but tbh, been distracted by the odd behavour of my folding clients.
At least I've kept 80% production running but as the failures to send back what look like good results is not followed through -it is made harder to diagnose.
A report like- 'don't like the results' is not that helpful as feedback from the servers!
I will post O/C data later on this thread, but I'm not doing anything too heavy- just maxing the board.
I will leave a post for mtm as I don't want to hijack the other thread.
I could now easily not use the C folder if that is what the problem is and just use another D directory.
I've done troubleshooting before on here , helping with the GPU species issue which had caused a lot of grief over time.
So, even if it's old news, many people may find it explains odd behaviours- to thrash out why things happen.
In the PG overview , a superfluity of folders makes for a different dynamic than enthusiasts beating the clock :)

Adding hardware details for MtM :
LGA 775 for Intel® Dual Core Core™ 2 Extreme / Core™ 2 Duo / Pentium® Dual Core / Celeron®, supporting Dual Core Wolfdale processors
Intel® 945GC A2 Chipset
Compatible with all FSB1333/1066/800/533 MHz CPUs except Quad Core
Supports Dual Channel DDRII667/533 x 2 DIMM slots with max. capacity up to 4GB

Running [XP pro] at 1333 FSB 333 Bus speed X9 on a e6600 Duo core Pentium[stock 2.4GHz] o/c to 3.0 GHz.Temp. cores~ 52C;Case temp 48C air cooled
Memory -DDR2 6400 2 x1GB running@ 250 Mhz 4:3 and timimg 4:4:4 12
GPU is GT240 o/c 640 core 1640 shaders 1740 memory. Running at circa 64 C [will run @70C] latest drivers

Re: Failed unit after 1 long week... Project: 10722 (Run 0,

Posted: Fri Feb 10, 2012 11:25 pm
by PantherX
new08 wrote:Yeah, Hondo- I'll do that. I did have a couple of scripts to cure the mouse sensitivity, but after working for a while they decided to lose interest and refuse to re- run now- so I've yet to find a generic solution [Win XP SP3]
What I (sometimes) do is a single click on the icon and then hit the enter key on my keyboard so I am always sure that the application has been started.

Re: Failed unit after 1 long week... Project: 10722 (Run 0,

Posted: Fri Feb 10, 2012 11:40 pm
by new08
Thanx for that little tip Panther- I'm sure as hell going to get that script working one day though!
I'm caught out on the CPU client prob as I keep getting 2 days long units and if I stop the one with a problem- it resets. Never seen that before- so it's a slow process debugging the bugger :) If the next one plays up in 5 hrs when it completes- I'll move it to another location and try again. I'm only doing this as a bug trace really!
I don't think it's a config problem- but you know Windows....

Re: Failed unit after 1 long week... Project: 10722 (Run 0,

Posted: Sat Feb 11, 2012 12:14 pm
by new08
Update: The C Client finished an uploaded ok. the rest as normal.
The only change prior to this was to de -synch the PCIe bus from the FSB.
I can't imagine this had any real effect or that Ver6.23 can detect bios config changes like this. Instability is a different matter- by it's very name. That has not been seen in the course this issues.
I'm pleased things have settled down -and the little discussion I've had this thread has been useful.
I think many have just sat back and waited unable to comment- but I still think it's worth putting out the details for others to check on.
It's not that F@H is rocket science on the donors' end, relying on clever software impementation a lot, but it is a bit 'scatter gun' and why many drop out, I'd wager. I suspect that many a sidelong look gets taken on some of these comments by PG!
I , like many mods and contributors here must like a battle- and if it was easy wouldn't be so much fun!
So long as the results stack in the end.... losing 10 days of one cores output is no big deal.