Page 1 of 1

Project: 6897 (Run 378, Clone 2, Gen 4)

Posted: Fri Mar 23, 2012 3:26 am
by brooknet
Greetings,

My FAH client (Linux 6.02) is taking a very long time to process this WU. It takes so long that the hard drive spins down and the computer just sits there, using lots of CPU. After a week of processing the same WU, I deleted it from the queue, but just got it back again when I restarted FAH. Is this a bad unit or should I wait? It might be better to have it processed by someone with some powerful hardware (my computer is an old Sempron 2400+ PC).

Thank you,

From Lex.

P.S. Here's the FAHlog-Prev.txt file (edited).

Code: Select all

[19:21:36] Initial: 0000; - Receiving payload (expected size: 661665)
[19:21:39] - Downloaded at ~215 kB/s
[19:21:39] - Averaged speed for that direction ~206 kB/s
[19:21:39] + Received work.
[19:21:39] Trying to send all finished work units
[19:21:39] + No unsent completed units remaining.
[19:21:39] + Closed connections
[19:21:39] 
[19:21:39] + Processing work unit
[19:21:39] Core required: FahCore_78.exe
[19:21:39] Core found.
[19:21:39] Working on Unit 04 [March 17 19:21:39]
[19:21:39] + Working ...
[19:21:39] - Calling './FahCore_78.exe -dir work/ -suffix 04 -checkpoint 15 -ver
bose -lifeline 2156 -version 602'

[19:21:39] 
[19:21:39] *------------------------------*
[19:21:39] Folding@Home Gromacs Core
[19:21:39] Version 1.90 (March 8, 2006)
[19:21:39] 
[19:21:39] Preparing to commence simulation
[19:21:39] - Looking at optimizations...

[19:21:39] - Created dyn
[19:21:39] - Files status OK
[19:21:40] - Expanded 661153 -> 3328716 (decompressed 503.4 percent)
[19:21:40] - Starting from initial work packet
[19:21:40] 
[19:21:40] Project: 6897 (Run 378, Clone 2, Gen 4)
[19:21:40] 
[19:21:40] Assembly optimizations on if available.
[19:21:40] Entering M.D.
[19:21:46] Protein: ALZHEIMER DISEASE AMYLOID
[19:21:46] 
[19:21:46] Writing local files
[19:22:21] Extra SSE boost OK.
[19:22:22] Writing local files
[19:22:22] Completed 0 out of 250000 steps  (0%)
[19:37:22] Timered checkpoint triggered.
[19:43:00] Writing local files
[19:43:00] Completed 2500 out of 250000 steps  (1%)
[19:57:59] Timered checkpoint triggered.
[20:03:36] Writing local files
[20:03:36] Completed 5000 out of 250000 steps  (2%)
[20:18:37] Timered checkpoint triggered.
[20:24:13] Writing local files
[20:24:13] Completed 7500 out of 250000 steps  (3%)
[20:39:13] Timered checkpoint triggered.
[20:44:50] Writing local files
[20:44:50] Completed 10000 out of 250000 steps  (4%)
[20:59:51] Timered checkpoint triggered.
[21:05:26] Writing local files

[21:05:26] Completed 12500 out of 250000 steps  (5%)
[21:20:26] Timered checkpoint triggered.
[21:26:02] Writing local files
[21:26:02] Completed 15000 out of 250000 steps  (6%)
[21:41:02] Timered checkpoint triggered.
[21:46:38] Writing local files
[21:46:39] Completed 17500 out of 250000 steps  (7%)
[21:57:59] - Autosending finished units...
[21:57:59] Trying to send all finished work units
[21:57:59] + No unsent completed units remaining.
[21:57:59] - Autosend completed
[22:01:38] Timered checkpoint triggered.
[22:07:15] Writing local files
[22:07:15] Completed 20000 out of 250000 steps  (8%)
[22:22:15] Timered checkpoint triggered.
[22:27:51] Writing local files
[22:27:51] Completed 22500 out of 250000 steps  (9%)
[22:42:51] Timered checkpoint triggered.
[22:48:27] Writing local files
[22:48:27] Completed 25000 out of 250000 steps  (10%)
[23:03:27] Timered checkpoint triggered.
[23:09:03] Writing local files
... [skipping a few percent] ...

Code: Select all

[20:08:20] Completed 180000 out of 250000 steps  (72%)
[20:23:21] Timered checkpoint triggered.
[20:28:59] Writing local files
[20:28:59] Completed 182500 out of 250000 steps  (73%)
[20:43:59] Timered checkpoint triggered.
[20:49:38] Writing local files
[20:49:38] Completed 185000 out of 250000 steps  (74%)
[21:04:39] Timered checkpoint triggered.
[21:10:16] Writing local files
[21:10:16] Completed 187500 out of 250000 steps  (75%)
[21:25:17] Timered checkpoint triggered.
[21:40:18] Timered checkpoint triggered.
[21:56:36] Timered checkpoint triggered.
[21:57:59] - Autosending finished units...
[21:57:59] Trying to send all finished work units
Here, it started to slow down.

Code: Select all

[21:57:59] + No unsent completed units remaining.
[21:57:59] - Autosend completed
[22:12:00] Timered checkpoint triggered.
[22:27:39] Timered checkpoint triggered.
[22:43:01] Timered checkpoint triggered.
[22:58:32] Timered checkpoint triggered.
[23:14:04] Timered checkpoint triggered.
[23:16:00] Writing local files
[23:16:06] Completed 190000 out of 250000 steps  (76%)
[23:31:41] Timered checkpoint triggered.
[23:47:09] Timered checkpoint triggered.
[00:03:05] Timered checkpoint triggered.
[00:18:45] Timered checkpoint triggered.
[00:33:57] Timered checkpoint triggered.
[00:49:20] Timered checkpoint triggered.
[01:04:30] Timered checkpoint triggered.
[01:19:51] Timered checkpoint triggered.
[01:35:12] Timered checkpoint triggered.
[01:50:52] Timered checkpoint triggered.
[02:06:14] Timered checkpoint triggered.
[02:21:55] Timered checkpoint triggered.
[02:37:07] Timered checkpoint triggered.
[02:52:15] Timered checkpoint triggered.
[03:07:37] Timered checkpoint triggered.
[03:23:01] Timered checkpoint triggered.
[03:38:16] Timered checkpoint triggered.
[03:53:47] Timered checkpoint triggered.
[03:57:59] - Autosending finished units...
[03:57:59] Trying to send all finished work units

[03:57:59] + No unsent completed units remaining.
[03:57:59] - Autosend completed
[04:09:03] Timered checkpoint triggered.
[04:24:18] Timered checkpoint triggered.
[04:39:48] Timered checkpoint triggered.
[04:55:10] Timered checkpoint triggered.
[05:11:05] Timered checkpoint triggered.
[05:26:42] Timered checkpoint triggered.
[05:42:04] Timered checkpoint triggered.
[05:57:59] Timered checkpoint triggered.
[06:13:29] Timered checkpoint triggered.
[06:28:43] Timered checkpoint triggered.
[06:44:32] Timered checkpoint triggered.
[07:00:03] Timered checkpoint triggered.
[07:15:19] Timered checkpoint triggered.
[07:31:08] Timered checkpoint triggered.
[07:46:44] Timered checkpoint triggered.
[08:01:57] Timered checkpoint triggered.
[08:17:08] Timered checkpoint triggered.
[08:32:36] Timered checkpoint triggered.
[08:48:03] Timered checkpoint triggered.
[09:03:14] Timered checkpoint triggered.
[09:18:52] Timered checkpoint triggered.
[09:34:21] Timered checkpoint triggered.
[09:49:35] Timered checkpoint triggered.
[09:57:59] - Autosending finished units...
[09:57:59] Trying to send all finished work units
[09:57:59] + No unsent completed units remaining.
[09:57:59] - Autosend completed

[10:05:22] Timered checkpoint triggered.
[10:20:55] Timered checkpoint triggered.
[10:36:11] Timered checkpoint triggered.
[10:52:01] Timered checkpoint triggered.
[11:07:32] Timered checkpoint triggered.
[11:22:45] Timered checkpoint triggered.
[11:38:31] Timered checkpoint triggered.
[11:54:01] Timered checkpoint triggered.
[12:09:15] Timered checkpoint triggered.
[12:25:03] Timered checkpoint triggered.
[12:40:33] Timered checkpoint triggered.
[12:55:46] Timered checkpoint triggered.
[13:11:35] Timered checkpoint triggered.
[13:27:07] Timered checkpoint triggered.
[13:42:22] Timered checkpoint triggered.
[13:58:08] Timered checkpoint triggered.
[14:13:36] Timered checkpoint triggered.
[14:28:47] Timered checkpoint triggered.
[14:44:32] Timered checkpoint triggered.
[15:00:02] Timered checkpoint triggered.
[15:15:13] Timered checkpoint triggered.
[15:30:56] Timered checkpoint triggered.
[15:46:22] Timered checkpoint triggered.
[15:57:59] - Autosending finished units...
[15:57:59] Trying to send all finished work units
[15:57:59] + No unsent completed units remaining.
[15:57:59] - Autosend completed
[16:01:34] Timered checkpoint triggered.
[16:17:22] Timered checkpoint triggered.

[16:32:54] Timered checkpoint triggered.
[16:48:09] Timered checkpoint triggered.
[17:03:57] Timered checkpoint triggered.
[17:19:26] Timered checkpoint triggered.
[17:34:35] Timered checkpoint triggered.
[17:50:17] Timered checkpoint triggered.
[18:05:43] Timered checkpoint triggered.
[18:20:53] Timered checkpoint triggered.
[18:36:39] Timered checkpoint triggered.
[18:52:07] Timered checkpoint triggered.
[19:07:17] Timered checkpoint triggered.
[19:22:59] Timered checkpoint triggered.
[19:38:23] Timered checkpoint triggered.
[19:53:32] Timered checkpoint triggered.
[20:09:17] Timered checkpoint triggered.
[20:24:46] Timered checkpoint triggered.
[20:39:59] Timered checkpoint triggered.
[20:55:46] Timered checkpoint triggered.
[21:11:16] Timered checkpoint triggered.
[21:26:28] Timered checkpoint triggered.
[21:42:12] Timered checkpoint triggered.
I Ctrl-C'd the client.

Code: Select all

[21:54:45] ***** Got an Activate signal (2)
[21:54:45] Killing all core threads

Folding@Home Client Shutdown.

Re: Project: 6897 (Run 378, Clone 2, Gen 4)

Posted: Fri Mar 23, 2012 4:56 pm
by 7im
That's too bad, you were almost done with that one.

That was one of the larger work units, with a 27 day deadline, but you were in no danger of missing the deadline and could have gotten points for it.

Change the setting on your hard drive so it doesn't spin down, and the client won't have that problem again. ;)

Re: Project: 6897 (Run 378, Clone 2, Gen 4)

Posted: Sat Mar 24, 2012 10:04 am
by brooknet
Hello,

Oh - that's a pity, but it was taking so long to process that I thought that something was amiss. Unfortunately I lack the knowledge to disable the spin-down, because the drive is a USB-connected hard drive. If I use hdparm to address it (I'm using Ubuntu 10.04) I get various errors, depending on the command submitted:

Code: Select all

lex@spatula:~$ sudo hdparm -S0 /dev/sdb

/dev/sdb:
 setting standby to 0 (off)
 HDIO_DRIVE_CMD(setidle) failed: Invalid exchange
It certainly would be nice if I could disable that spindown, because the Folding@Home client could then take as long as it wished.

Lex

Re: Project: 6897 (Run 378, Clone 2, Gen 4)

Posted: Sat Mar 24, 2012 4:03 pm
by Joe_H
I doubt the problem was related to drive spin down. Every checkpoint would have spun the drive back up, and the checkpoints start getting to where multiple ones are being written between each frame anyways. I don't process many core 78's these days, but I used to occasionally see this happen. Absent some other process on the machine taking up CPU time, it may point to a bad WU. Other times I could get it to clear up by restarting the folding process and letting it resume from the last checkpoint.

Re: Project: 6897 (Run 378, Clone 2, Gen 4)

Posted: Sat Mar 24, 2012 10:52 pm
by brooknet
To Joe_H:

Thank you for your message, which I am quoting.
I doubt the problem was related to drive spin down.
Yes, whoops - I got a bit confused about what I was trying to say, there. Rather than the problem being caused by the drive spinning-down, I think I meant that the drive spinning-down meant that there was a problem. Or maybe I didn't - I haven't had much sleep lately.
I don't process many core 78's these days, but I used to occasionally see this happen.
Oh yes, there it is - I hadn't noticed that.

Code: Select all

[22:47:38] Core required: FahCore_78.exe
[22:47:38] Core found.
[22:47:38] Working on Unit 06 [March 24 22:47:38]
I admit that I don't look at the output of the client very closely, and neither do I understand the difference between one core and another. Thanks for the hint, though.
Absent some other process on the machine taking up CPU time, it may point to a bad WU. Other times I could get it to clear up by restarting the folding process and letting it resume from the last checkpoint.
I've started the client again, and it's started work on this unit - from 0%. Serves me right for deleting it before, I know.

Lex

Re: Project: 6897 (Run 378, Clone 2, Gen 4)

Posted: Sat Mar 24, 2012 11:52 pm
by Joe_H
Other than noting that core 78 is one of the older cores and is used on older and smaller projects suitable for less powerful systems, no reason to really understand the differences between various ones. It does have some issues or bugs specific to it which can be useful to know when troubleshooting a problem. Eventually core 78 will be retired, in the meantime they have older projects that still have work to be finished. Hopefully this run will go better, if it takes the same amount of time per frame as it did up to the 75% point, 36 hours total. If it does the same thing, the WU could be bad or the core unable to converge on a solution. Usually that will be detected by the core processing and terminated on its own. But the older code sometimes does not under the right (wrong) conditions.

Re: Project: 6897 (Run 378, Clone 2, Gen 4)

Posted: Mon Mar 26, 2012 8:05 pm
by bruce
Because you missed the deadline, the same WU was reassigned to someone else who has successfully returned it:
The WU (P6897 R378 C2 G4) was added to the stats database on 2012-03-26 05:04:23 for 135 points of credit.

If you do start it again, you should get a new assignment, since that one has already been completed.
It might be better to have it processed by someone with some powerful hardware (my computer is an old Sempron 2400+ PC).
Whether they're assigned to a more powerful computer or a less powerful computer only matters if you miss the deadlines. All WUs must be completed. It all works out in the end, whether your 2400+ works on one long WU or several short ones in the same time period.

Re: Project: 6897 (Run 378, Clone 2, Gen 4)

Posted: Mon Mar 26, 2012 8:52 pm
by brooknet
Hello Bruce,
Because you missed the deadline, the same WU was reassigned to someone else who has successfully returned it:
The WU (P6897 R378 C2 G4) was added to the stats database on 2012-03-26 05:04:23 for 135 points of credit.
Oh.. that's really bad luck, as when I started Folding@Home again, it loaded up this WU and this time, it zipped through it and completed it at 12:22 UTC today (March 26). It's now halfway through the next WU.

Thanks for your assistance, everyone. I think that I should learn to be a little more patient with this; it's just that I kept hearing the clicking sound of the drive spinning up, and I was worried that it would cause it to fail, if it kept doing that. As I may have mentioned before, the computer has a problem with random data corruption on its SATA controller, so I used the USB drive instead.

Lex

Re: Project: 6897 (Run 378, Clone 2, Gen 4)

Posted: Mon Mar 26, 2012 9:37 pm
by 7im
After a 2nd look, the hard drive shouldn't spin down since the client is writing a check point and updating the fahlog every 15 minutes. Something else spun down to power saving mode, or something took over, like an AV scan. It's also possible a memory glitch or something else put the client in a loop it couldn't work out.

Running a few diagnostic tests, when you have time, wouldn't hurt to make sure the hardware is still working correctly. Memtest, disk scan, OCCT, things like that.

Anyway, good learning tool... ;)

Re: Project: 6897 (Run 378, Clone 2, Gen 4)

Posted: Tue Mar 27, 2012 1:26 pm
by brooknet
After a 2nd look, the hard drive shouldn't spin down since the client is writing a check point and updating the fahlog every 15 minutes. Something else spun down to power saving mode, or something took over, like an AV scan. It's also possible a memory glitch or something else put the client in a loop it couldn't work out.
Yes, you're absolutely right. I think that the issue is that the drive's default spindown time is set to a very short period - possibly as short as 5 minutes. Nothing else should spin the drive down because it's just handling the /home dirs, and the system happily carries on when the drive is detached (well, I say 'happily' but no-one's happy without a home!).
Running a few diagnostic tests, when you have time, wouldn't hurt to make sure the hardware is still working correctly. Memtest, disk scan, OCCT, things like that.
I (usually) do all of these things regularly - apart from 'OCCT', because as far as I am aware, it's not available for Linux. This motherboard (ABit AN7) been a problem since I first set it up: it's the chipset. Without wishing to libel the manufacturer, may I just say that it's 'a little below par'? The AGP implementation causes lock-ups occasionally, the built-in SATA adaptor causes random r/w errors. The AN8 is a much better board.

Lex