Page 2 of 2

Re: The P8101 is killing my SR2

Posted: Wed Apr 25, 2012 10:22 pm
by Punchy
Normally only one of the two choices (NUMA and Node Interleaving) will be available in the BIOS menus, since they are basically the opposite of each other. You either enable NUMA or disable node interleaving. In your case, make sure Node Interleaving is disabled, and don't worry about finding a NUMA setting, though you might have an SRAT setting separately, which should be enabled.

Re: The P8101 is killing my SR2

Posted: Thu Apr 26, 2012 7:07 am
by -alias-
RozSummer wrote:-alias-
I thought this was a particularly good article explaining NUMA and how that relates to Node Interleaving and the construction of a System Resource Allocation Table (SRAT). From this article, it makes sense to have NUMA enabled and Node Interleaving disabled. NUMA is a memory access scheme used by both Intel and AMD. Perhaps your Tyan board defaults to "NUMA Enabled", which in that case you'd simply need to make sure Node Interleaving is disabled.

http://frankdenneman.nl/2010/12/node-in ... r-disable/

Roz
Thanks, the article was clarifying, and helped me to figure it out.

Re: The P8101 is killing my SR2

Posted: Tue May 15, 2012 7:53 pm
by -alias-
I lost another 8101 WU on my SR-2. : thumbdown:

The rig stopped (froze) after the final result was written to disk, and when I started it again, it was obviously something wrong on the file and it was deleted and a new 8101 was taken down, with the result that 245k went of to folding hell.

The (my subjektive) conclusion is that my SR-2 and the P8101 WU are not quite friends, and that this will happen from time to time. Will try to clock it a couple of clicks down again. It is never hot, barely above 50 Celsius, so overheating or trotling of the CPU it can not be.

Re: The P8101 is killing my SR2

Posted: Tue May 15, 2012 7:57 pm
by 7im
Please post the log showing the error.

Re: The P8101 is killing my SR2

Posted: Tue May 15, 2012 8:31 pm
by bruce
-alias- wrote:I lost another 8101 WU on my SR-2. : thumbdown:

The rig stopped (froze) after the final result was written to disk, and when I started it again, it was obviously something wrong on the file and it was deleted and a new 8101 was taken down, with the result that 245k went of to folding hell.

The (my subjektive) conclusion is that my SR-2 and the P8101 WU are not quite friends, and that this will happen from time to time. Will try to clock it a couple of clicks down again. It is never hot, barely above 50 Celsius, so overheating or trotling of the CPU it can not be.
Are you absolutely sure it had finished physically writing the data to disk? It sounds very much like you're not giving Linux enough time to actually finish writing the cached data to the HD.

Re: The P8101 is killing my SR2

Posted: Tue May 15, 2012 8:38 pm
by -alias-
7im wrote:Please post the log showing the error.

Code: Select all

[23:51:32] - Autosending finished units... [May 14 23:51:32 UTC]
[23:51:32] Trying to send all finished work units
[23:51:32] + No unsent completed units remaining.
[23:51:32] - Autosend completed
[23:55:44] Completed 147500 out of 250000 steps  (59%)
[00:20:02] Completed 150000 out of 250000 steps  (60%)
[00:44:22] Completed 152500 out of 250000 steps  (61%)
[01:08:41] Completed 155000 out of 250000 steps  (62%)
[01:32:58] Completed 157500 out of 250000 steps  (63%)
[01:57:17] Completed 160000 out of 250000 steps  (64%)
[02:21:35] Completed 162500 out of 250000 steps  (65%)
[02:45:53] Completed 165000 out of 250000 steps  (66%)
[03:10:13] Completed 167500 out of 250000 steps  (67%)
[03:34:31] Completed 170000 out of 250000 steps  (68%)
[03:58:51] Completed 172500 out of 250000 steps  (69%)
[04:23:11] Completed 175000 out of 250000 steps  (70%)
[04:47:31] Completed 177500 out of 250000 steps  (71%)
[05:11:49] Completed 180000 out of 250000 steps  (72%)
[05:36:09] Completed 182500 out of 250000 steps  (73%)
[05:51:32] - Autosending finished units... [May 15 05:51:32 UTC]
[05:51:32] Trying to send all finished work units
[05:51:32] + No unsent completed units remaining.
[05:51:32] - Autosend completed
[06:00:39] Completed 185000 out of 250000 steps  (74%)
[06:24:57] Completed 187500 out of 250000 steps  (75%)
[06:49:17] Completed 190000 out of 250000 steps  (76%)
[07:13:36] Completed 192500 out of 250000 steps  (77%)
[07:37:55] Completed 195000 out of 250000 steps  (78%)
[08:02:14] Completed 197500 out of 250000 steps  (79%)
[08:26:33] Completed 200000 out of 250000 steps  (80%)
[08:50:52] Completed 202500 out of 250000 steps  (81%)
[09:15:10] Completed 205000 out of 250000 steps  (82%)
[09:39:27] Completed 207500 out of 250000 steps  (83%)
[10:03:46] Completed 210000 out of 250000 steps  (84%)
[10:28:07] Completed 212500 out of 250000 steps  (85%)
[10:52:22] Completed 215000 out of 250000 steps  (86%)
[11:16:40] Completed 217500 out of 250000 steps  (87%)
[11:40:58] Completed 220000 out of 250000 steps  (88%)
[11:51:32] - Autosending finished units... [May 15 11:51:32 UTC]
[11:51:32] Trying to send all finished work units
[11:51:32] + No unsent completed units remaining.
[11:51:32] - Autosend completed
[12:05:16] Completed 222500 out of 250000 steps  (89%)
[12:29:35] Completed 225000 out of 250000 steps  (90%)
[12:53:55] Completed 227500 out of 250000 steps  (91%)
[13:18:12] Completed 230000 out of 250000 steps  (92%)
[13:42:31] Completed 232500 out of 250000 steps  (93%)
[14:06:50] Completed 235000 out of 250000 steps  (94%)
[14:31:11] Completed 237500 out of 250000 steps  (95%)
[14:55:28] Completed 240000 out of 250000 steps  (96%)
[15:19:48] Completed 242500 out of 250000 steps  (97%)
[15:44:06] Completed 245000 out of 250000 steps  (98%)
[16:08:25] Completed 247500 out of 250000 steps  (99%)
[16:32:42] Completed 250000 out of 250000 steps  (100%)
[16:32:50] DynamicWrapper: Finished Work Unit: sleep=10000
[16:33:00] 
[16:33:00] Finished Work Unit:
[16:33:00] - Reading up to 64340496 from "work/wudata_09.trr": Read 64340496
[16:33:01] trr file hash check passed.
[16:33:01] - Reading up to 31617124 from "work/wudata_09.xtc": Read 31617124
[16:33:01] xtc file hash check passed.
[16:33:01] edr file hash check passed.
[16:33:01] logfile size: 233960
[16:33:01] Leaving Run
[16:33:06] - Writing 96352456 bytes of core data to disk...
[16:33:21] Done: 96351944 -> 91577767 (compressed to 5.8 percent)
[16:33:21]   ... Done.
[16:33:28] - Shutting down core
[16:33:28] 
[16:33:28] Folding@home Core Shutdown: FINISHED_UNIT
This was the last line in the log, and I have to restart the rig again to get acess. But as you can see there was no error in the log because the computer froze after FINISHED_UNIT. Exactly when the rig froze I do not know!

Re: The P8101 is killing my SR2

Posted: Tue May 15, 2012 8:43 pm
by 7im
How long did you wait? As bruce said, maybe you didn't wait long enough and interrupted the client while writing the data to disk... BA WUs take a long time to write to disk.

Re: The P8101 is killing my SR2

Posted: Tue May 15, 2012 8:48 pm
by -alias-
The log says it was completed writing to disk, "Done" I belive means completed?

[16:33:06] - Writing 96352456 bytes of core data to disk...
[16:33:21] Done: 96351944 -> 91577767 (compressed to 5.8 percent)
[16:33:21] ... Done.
[16:33:28] - Shutting down core
[16:33:28]
[16:33:28] Folding@home Core Shutdown: FINISHED_UNIT

Here is the log after the rig was started again, after about 2 hour:

Code: Select all

--- Opening Log file [May 15 18:33:46 UTC] 


# Linux SMP Console Edition ###################################################
###############################################################################

                       Folding@Home Client Version 6.34

                          http://folding.stanford.edu

###############################################################################
###############################################################################

Launch directory: /home/vidar/fah
Executable: /home/vidar/fah/fah6
Arguments: -smp 24 -bigadv -verbosity 9 

[18:33:46] - Ask before connecting: No
[18:33:46] - Proxy: 127.0.0.1:8880
[18:33:46] - User name: -alias- (Team 37651)
[18:33:46] - User ID: 1051488426ADAF6A
[18:33:46] - Machine ID: 1
[18:33:46] 
[18:33:46] Loaded queue successfully.
[18:33:46] 
[18:33:46] + Processing work unit
[18:33:46] - Autosending finished units... [May 15 18:33:46 UTC]
[18:33:46] Core required: FahCore_a5.exe
[18:33:46] Trying to send all finished work units
[18:33:46] Core found.
[18:33:46] + No unsent completed units remaining.
[18:33:46] - Autosend completed
[18:33:46] Working on queue slot 09 [May 15 18:33:46 UTC]
[18:33:46] + Working ...
[18:33:46] - Calling './FahCore_a5.exe -dir work/ -nice 19 -suffix 09 -np 24 -checkpoint 3 -verbose -lifeline 1880 -version 634'

[18:33:46] 
[18:33:46] *------------------------------*
[18:33:46] Folding@Home Gromacs SMP Core
[18:33:46] Version 2.27 (Thu Feb 10 09:46:40 PST 2011)
[18:33:46] 
[18:33:46] Preparing to commence simulation
[18:33:46] - Ensuring status. Please wait.
[18:33:53] - Looking at optimizations...
[18:33:53] - Working with standard loops on this execution.
[18:33:53] - Created dyn
[18:33:53] - Files status OK
[18:33:53] Error: Missing work file=<>
[18:33:53] 
[18:33:53] Folding@home Core Shutdown: MISSING_WORK_FILES
[18:33:54] CoreStatus = 74 (116)
[18:33:54] The core could not find the work files specified. Removing from queue
[18:33:54] Deleting current work unit & continuing...
[18:33:54] Trying to send all finished work units
[18:33:54] + No unsent completed units remaining.
[18:33:54] - Preparing to get new work unit...
[18:33:54] Cleaning up work directory
[18:33:54] + Attempting to get work packet
[18:33:54] Passkey found
[18:33:54] - Will indicate memory of 12033 MB
[18:33:54] - Connecting to assignment server
[18:33:54] Connecting to http://assign.stanford.edu:8080/
[18:33:55] Posted data.
[18:33:55] Initial: 8F80; - Successful: assigned to (128.143.231.201).
[18:33:55] + News From Folding@Home: Welcome to Folding@Home
[18:33:55] Loaded queue successfully.
[18:33:55] Sent data
[18:33:55] Connecting to http://128.143.231.201:8080/
[18:34:43] Posted data.
[18:34:43] Initial: 0000; - Receiving payload (expected size: 30311442)
[18:35:52] - Downloaded at ~429 kB/s
[18:35:52] - Averaged speed for that direction ~384 kB/s
[18:35:52] + Received work.
[18:35:52] + Closed connections
[18:35:57] 
[18:35:57] + Processing work unit
[18:35:57] Core required: FahCore_a5.exe
[18:35:57] Core found.
[18:35:57] Working on queue slot 00 [May 15 18:35:57 UTC]
[18:35:57] + Working ...
[18:35:57] - Calling './FahCore_a5.exe -dir work/ -nice 19 -suffix 00 -np 24 -checkpoint 3 -verbose -lifeline 1880 -version 634'

[18:35:57] 
[18:35:57] *------------------------------*
[18:35:57] Folding@Home Gromacs SMP Core
[18:35:57] Version 2.27 (Thu Feb 10 09:46:40 PST 2011)
[18:35:57] 
[18:35:57] Preparing to commence simulation
[18:35:57] - Looking at optimizations...
[18:35:57] - Created dyn
[18:35:57] - Files status OK
[18:35:59] - Expanded 30310930 -> 33158016 (decompressed 109.3 percent)
[18:35:59] Called DecompressByteArray: compressed_data_size=30310930 data_size=33158016, decompressed_data_size=33158016 diff=0
[18:35:59] - Digital signature verified
[18:35:59] 
[18:35:59] Project: 8101 (Run 12, Clone 6, Gen 5)
[18:35:59] 
[18:35:59] Assembly optimizations on if available.
[18:35:59] Entering M.D.
[18:36:06] Mapping NT from 24 to 24 
[18:36:09] Completed 0 out of 250000 steps  (0%)
[18:40:29] ng M.D.
[18:40:35] Using Gromacs checkpoints
[18:40:37] Mapping NT from 24 to 24 
[18:40:54] Resuming from checkpoint
[18:40:56] Verified work/wudata_00.log
[18:40:56] Verified work/wudata_00.trr
[18:40:56] Verified work/wudata_00.xtc
[18:40:56] Verified work/wudata_00.edr
[18:40:56] Completed 320 out of 250000 steps  (0%)
[19:01:59] Completed 2500 out of 250000 steps  (1%)

Re: The P8101 is killing my SR2

Posted: Tue May 15, 2012 9:12 pm
by 7im
"Writing to disk" is actually writing to disk cache. Depending on your operating system settings, disk cache is not immediately written to the hard disk.

But I can only assume 2 hours is long enough to wait... Is this ext3 or ext4 file system?

Re: The P8101 is killing my SR2

Posted: Tue May 15, 2012 9:36 pm
by -alias-
ext3

Re: The P8101 is killing my SR2

Posted: Tue May 15, 2012 11:05 pm
by Leonardo
1. Are you running Langouste.
2. To see if the completed unit still resides on your hard drive: go the client folder - Home/fah/work, and look for a file named wuresults_xx.dat, where 'xx' is the number of work unit, ref. Queue. If there is a file there and it was the last work unit, you need to manually send it to the server. cd fah; ./fah6 -send all


EDIT: forget other instructions: if you intend to manually send the unit and if you have Langouste configured, you will need to enter the Fodling client configuration and change Proxy to "No."

Re: The P8101 is killing my SR2

Posted: Wed May 16, 2012 7:41 am
by -alias-
Thanks, I will try it as you suggest, and yes, I use Langouste too. I report back if it goes well.

Edit: It did not work, see the log below. I guess some files was damaged, but shit happens. Thank you for your help anyway!

Code: Select all

--- Opening Log file [May 16 09:04:13 UTC] 


# Linux SMP Console Edition ###################################################
###############################################################################

                       Folding@Home Client Version 6.34

                          http://folding.stanford.edu

###############################################################################
###############################################################################

Launch directory: /home/vidar/fah
Executable: ./fah6
Arguments: -smp -bigadv -verbosity 9 -configonly 

[09:04:13] - Ask before connecting: No
[09:04:13] Failed to resolve hostname for proxy. Will connect directly.
[09:04:13] - Proxy: local:8080
[09:04:13] - User name: -alias- (Team 37651)
[09:04:13] - User ID: 1051488426ADAF6A
[09:04:13] - Machine ID: 1
[09:04:13] 
[09:04:13] Configuring Folding@Home...


[09:04:24] - Ask before connecting: No
[09:04:24] - User name: -alias- (Team 37651)
[09:04:24] - User ID: 1051488426ADAF6A
[09:04:24] - Machine ID: 1
[09:04:24] 
[09:04:24] -configonly flag given, so exiting.


--- Opening Log file [May 16 09:04:33 UTC] 


# Linux Console Edition #######################################################
###############################################################################

                       Folding@Home Client Version 6.34

                          http://folding.stanford.edu

###############################################################################
###############################################################################

Launch directory: /home/vidar/fah
Executable: ./fah6
Arguments: -send all 

[09:04:33] - Ask before connecting: No
[09:04:33] - User name: -alias- (Team 37651)
[09:04:33] - User ID: 1051488426ADAF6A
[09:04:33] - Machine ID: 1
[09:04:33] 
[09:04:33] Loaded queue successfully.
[09:04:33] Attempting to return result(s) to server...

Folding@Home Client Shutdown.


--- Opening Log file [May 16 09:05:04 UTC] 


# Linux Console Edition #######################################################
###############################################################################

                       Folding@Home Client Version 6.34

                          http://folding.stanford.edu

###############################################################################
###############################################################################

Launch directory: /home/vidar/fah
Executable: ./fah6
Arguments: -send 09 

[09:05:04] - Ask before connecting: No
[09:05:04] - User name: -alias- (Team 37651)
[09:05:04] - User ID: 1051488426ADAF6A
[09:05:04] - Machine ID: 1
[09:05:04] 
[09:05:04] Loaded queue successfully.
[09:05:04] Attempting to return result(s) to server...
[09:05:04] Project: 8101 (Run 13, Clone 6, Gen 7)
[09:05:04] - Failed to send unit 09 to server

Folding@Home Client Shutdown.