Project: 5722 (Run 2, Clone 0, Gen 17)

Moderators: Site Moderators, FAHC Science Team

Post Reply
Ivoshiee
Site Moderator
Posts: 822
Joined: Sun Dec 02, 2007 12:05 am
Location: Estonia

Project: 5722 (Run 2, Clone 0, Gen 17)

Post by Ivoshiee »

I had another mystic start new WU beyond 0% completion (http://foldingforum.org/viewtopic.php?f=19&t=7347). This time from 18%:

Code: Select all

[06:59:12] - Preparing to get new work unit...
[06:59:12] + Attempting to get work packet
[06:59:12] - Will indicate memory of 3327 MB
[06:59:12] - Detect CPU. Vendor: AuthenticAMD, Family: 15, Model: 2, Stepping: 2
[06:59:12] - Connecting to assignment server
[06:59:12] Connecting to http://assign-GPU.stanford.edu:8080/
[06:59:13] Posted data.
[06:59:13] Initial: 40AB; - Successful: assigned to (171.64.65.102).
[06:59:13] + News From Folding@Home: GPU folding beta
[06:59:13] Loaded queue successfully.
[06:59:13] Connecting to http://171.64.65.102:8080/
[06:59:14] Posted data.
[06:59:14] Initial: 0000; - Receiving payload (expected size: 97236)
[06:59:16] - Downloaded at ~47 kB/s
[06:59:16] - Averaged speed for that direction ~47 kB/s
[06:59:16] + Received work.
[06:59:16] Trying to send all finished work units
[06:59:16] + No unsent completed units remaining.
[06:59:16] + Closed connections
[06:59:16] 
[06:59:16] + Processing work unit
[06:59:16] Core required: FahCore_12.exe
[06:59:16] Core found.
[06:59:16] Working on queue slot 09 [December 8 06:59:16 UTC]
[06:59:16] + Working ...
[06:59:16] - Calling '.\FahCore_12.exe -dir work/ -suffix 09 -checkpoint 15 -service -forceasm -verbose -lifeline 416 -version 620'

[06:59:16] 
[06:59:16] *------------------------------*
[06:59:16] Folding@Home GPU Core - Beta
[06:59:16] Version 1.20 (Wed Nov 19 14:15:53 PST 2008)
[06:59:16] 
[06:59:16] Compiler  : Microsoft (R) 32-bit C/C++ Optimizing Compiler Version 14.00.50727.762 for 80x86 
[06:59:16] Build host: amoeba
[06:59:16] Board Type: AMD
[06:59:16] Core      : 
[06:59:16] Preparing to commence simulation
[06:59:16] - Assembly optimizations manually forced on.
[06:59:16] - Not checking prior termination.
[06:59:16] - Expanded 96724 -> 489152 (decompressed 505.7 percent)
[06:59:16] Called DecompressByteArray: compressed_data_size=96724 data_size=489152, decompressed_data_size=489152 diff=0
[06:59:16] - Digital signature verified
[06:59:16] 
[06:59:16] Project: 5722 (Run 2, Clone 0, Gen 17)
[06:59:16] 
[06:59:16] Assembly optimizations on if available.
[06:59:16] Entering M.D.
[06:59:22] Will resume from checkpoint file
[06:59:22] Working on Protein
[06:59:23] Client config found, loading data.
[06:59:23] Starting GUI Server
[06:59:32] Resuming from checkpoint
[06:59:32] fcSaveRestoreState: I/O failed dir=0, var=00DE5A90, varsize=16704
[06:59:32] fcSaveRestoreState: I/O failed dir=0, var=0124C0E0, varsize=4
[06:59:32] fcSaveRestoreState: I/O failed dir=0, var=0124C0E8, varsize=4
[06:59:32] fcSaveRestoreState: I/O failed dir=0, var=00E50868, varsize=52
[06:59:32] fcSaveRestoreState: I/O failed dir=0, var=00E50C6C, varsize=36
[06:59:32] fcSaveRestoreState: I/O failed dir=0, var=00E50CB4, varsize=36
[06:59:32] fcSaveRestoreState: I/O failed dir=0, var=00E50CD8, varsize=36
[06:59:32] Verified work/wudata_09.log
[06:59:32] Verified work/wudata_09.edr
[06:59:32] Verified work/wudata_09.xtc
[06:59:32] Completed 18%
[06:59:32] mdrun_gpu returned 
[06:59:32] Calculated & specified T inconsisitent
[06:59:32] 
[06:59:32] Folding@home Core Shutdown: UNSTABLE_MACHINE
[06:59:36] CoreStatus = 7A (122)
[06:59:36] Sending work to server
[06:59:36] Project: 5722 (Run 2, Clone 0, Gen 17)
[06:59:36] - Error: Could not get length of results file work/wuresults_09.dat
[06:59:36] - Error: Could not read unit 09 file. Removing from queue.
[06:59:36] Trying to send all finished work units
[06:59:36] + No unsent completed units remaining.
[06:59:36] - Preparing to get new work unit...
[06:59:36] + Attempting to get work packet
Some new messages have appeared to FAHlog. What ever does those messages mean, but I verified that the drive does have a lot of free space.
I see that the core in use is not the latest one, I'll update it and check if it does change anything.


System: HD4830, Windows XP-Pro 32-bit, CAL 8.10.
VijayPande
Pande Group Member
Posts: 2058
Joined: Fri Nov 30, 2007 6:25 am
Location: Stanford

Re: Project: 5722 (Run 2, Clone 0, Gen 17)

Post by VijayPande »

I've asked Mark to look into this.
Prof. Vijay Pande, PhD
Departments of Chemistry, Structural Biology, and Computer Science
Chair, Biophysics
Director, Folding@home Distributed Computing Project
Stanford University
Xilikon
Posts: 155
Joined: Sun Dec 02, 2007 1:34 pm

Re: Project: 5722 (Run 2, Clone 0, Gen 17)

Post by Xilikon »

If it's indeed a issue, it looks like some work files is written badly so it show a incorrent % and surely incorrect data, thus the failure.
Image
Ivoshiee
Site Moderator
Posts: 822
Joined: Sun Dec 02, 2007 12:05 am
Location: Estonia

Re: Project: 5722 (Run 2, Clone 0, Gen 17)

Post by Ivoshiee »

Xilikon wrote:If it's indeed a issue, it looks like some work files is written badly so it show a incorrent % and surely incorrect data, thus the failure.
Indeed something is funny:

Code: Select all

[17:47:14] + Results successfully sent
[17:47:14] Thank you for your contribution to Folding@Home.
[17:47:14] + Number of Units Completed: 330

[17:47:18] Trying to send all finished work units
[17:47:18] + No unsent completed units remaining.
[17:47:18] - Preparing to get new work unit...
[17:47:18] + Attempting to get work packet
[17:47:18] - Will indicate memory of 3327 MB
[17:47:18] - Detect CPU. Vendor: AuthenticAMD, Family: 15, Model: 2, Stepping: 2
[17:47:18] - Connecting to assignment server
[17:47:18] Connecting to http://assign-GPU.stanford.edu:8080/
[17:47:19] Posted data.
[17:47:19] Initial: 40AB; - Successful: assigned to (171.64.65.102).
[17:47:19] + News From Folding@Home: GPU folding beta
[17:47:20] Loaded queue successfully.
[17:47:20] Connecting to http://171.64.65.102:8080/
[17:47:20] Posted data.
[17:47:20] Initial: 0000; - Receiving payload (expected size: 69100)
[17:47:22] - Downloaded at ~33 kB/s
[17:47:22] - Averaged speed for that direction ~47 kB/s
[17:47:22] + Received work.
[17:47:22] Trying to send all finished work units
[17:47:22] + No unsent completed units remaining.
[17:47:22] + Closed connections
[17:47:22] 
[17:47:22] + Processing work unit
[17:47:22] Core required: FahCore_12.exe
[17:47:22] Core found.
[17:47:22] Working on queue slot 07 [December 8 17:47:22 UTC]
[17:47:22] + Working ...
[17:47:22] - Calling '.\FahCore_12.exe -dir work/ -suffix 07 -checkpoint 15 -service -forceasm -verbose -lifeline 184 -version 620'

[17:47:22] 
[17:47:22] *------------------------------*
[17:47:22] Folding@Home GPU Core - Beta
[17:47:22] Version 1.21 (Tue Nov 25 14:54:02 PST 2008)
[17:47:22] 
[17:47:22] Compiler  : Microsoft (R) 32-bit C/C++ Optimizing Compiler Version 14.00.50727.762 for 80x86 
[17:47:22] Build host: amoeba
[17:47:22] Board Type: AMD
[17:47:22] Core      : 
[17:47:22] Preparing to commence simulation
[17:47:22] - Assembly optimizations manually forced on.
[17:47:22] - Not checking prior termination.
[17:47:22] - Expanded 68588 -> 357580 (decompressed 521.3 percent)
[17:47:22] Called DecompressByteArray: compressed_data_size=68588 data_size=357580, decompressed_data_size=357580 diff=0
[17:47:22] - Digital signature verified
[17:47:22] 
[17:47:22] Project: 5731 (Run 1, Clone 0, Gen 18)
[17:47:22] 
[17:47:23] Assembly optimizations on if available.
[17:47:23] Entering M.D.
[17:47:29] Will resume from checkpoint file
[17:47:29] Working on Protein
[17:47:29] Client config found, loading data.
[17:47:29] Starting GUI Server
[17:47:39] Resuming from checkpoint
[17:47:39] Verified work/wudata_07.log
[17:47:39] Verified work/wudata_07.edr
[17:47:39] Verified work/wudata_07.xtc
[17:47:39] Completed 91%
[17:47:39] mdrun_gpu returned 
[17:47:39] Calculated & specified T inconsisitent
[17:47:39] 
[17:47:39] Folding@home Core Shutdown: UNSTABLE_MACHINE
[17:47:42] CoreStatus = 7A (122)
[17:47:42] Sending work to server
[17:47:42] Project: 5731 (Run 1, Clone 0, Gen 18)
[17:47:42] - Error: Could not get length of results file work/wuresults_07.dat
[17:47:42] - Error: Could not read unit 07 file. Removing from queue.
[17:47:42] Trying to send all finished work units
[17:47:42] + No unsent completed units remaining.
[17:47:42] - Preparing to get new work unit...
[17:47:42] + Attempting to get work packet
[17:47:42] - Will indicate memory of 3327 MB
[17:47:42] - Connecting to assignment server
Different WU but same WU slot (7) and same starting %. That is simply not possible.

Code: Select all

C:\folding\work>dir *
 Volume in drive C has no label.
 Volume Serial Number is 7E13-6968

 Directory of C:\folding\work

12/08/2008  19:48    <DIR>          .
12/08/2008  19:48    <DIR>          ..
12/08/2008  19:47                 0 core78.sta
12/08/2008  15:13               986 logfile_01.txt
12/08/2008  15:14               935 logfile_02.txt
12/08/2008  15:14               935 logfile_03.txt
12/08/2008  15:14               935 logfile_04.txt
12/08/2008  15:15               935 logfile_05.txt
12/08/2008  19:47               987 logfile_07.txt
12/08/2008  21:08             1 025 logfile_08.txt
12/08/2008  08:59             1 429 logfile_09.txt
11/29/2008  16:13           149 532 wudata_01.ckp
12/08/2008  15:13            68 962 wudata_01.dat
12/08/2008  15:13               560 wudata_01.edr
12/08/2008  15:13            22 728 wudata_01.log
12/08/2008  15:13           353 437 wudata_01.tpr
12/08/2008  15:13           178 324 wudata_01.xtc
11/22/2008  05:08           149 532 wudata_02.ckp
12/08/2008  15:13            99 000 wudata_02.dat
12/07/2008  13:23               560 wudata_02.edr
12/08/2008  15:13               560 wudata_02.edr_tmp
12/07/2008  13:23            16 207 wudata_02.log
12/08/2008  15:14            15 489 wudata_02.log_tmp
12/08/2008  15:13           488 045 wudata_02.tpr
12/07/2008  13:23            24 128 wudata_02.xtc
12/08/2008  15:14             5 336 wudata_02.xtc_tmp
12/07/2008  14:05           149 532 wudata_03.ckp
12/08/2008  15:14            99 093 wudata_03.dat
12/07/2008  13:23               560 wudata_03.edr
12/08/2008  15:14               560 wudata_03.edr_tmp
12/07/2008  14:08            26 961 wudata_03.log
12/08/2008  15:14            15 489 wudata_03.log_tmp
12/08/2008  15:14           488 045 wudata_03.tpr
12/07/2008  14:08           289 204 wudata_03.xtc
12/08/2008  15:14             5 352 wudata_03.xtc_tmp
11/24/2008  21:48           149 532 wudata_04.ckp
12/08/2008  15:14            99 095 wudata_04.dat
12/07/2008  14:08               560 wudata_04.edr
12/08/2008  15:14               560 wudata_04.edr_tmp
12/07/2008  14:08            16 553 wudata_04.log
12/08/2008  15:14            15 489 wudata_04.log_tmp
12/08/2008  15:14           488 045 wudata_04.tpr
12/07/2008  14:08            24 128 wudata_04.xtc
12/08/2008  15:14             5 352 wudata_04.xtc_tmp
11/22/2008  20:40           149 532 wudata_05.ckp
12/08/2008  15:15            99 077 wudata_05.dat
12/07/2008  14:09               560 wudata_05.edr
12/08/2008  15:15               560 wudata_05.edr_tmp
12/07/2008  14:09            16 511 wudata_05.log
12/08/2008  15:15            15 487 wudata_05.log_tmp
12/08/2008  15:15           488 045 wudata_05.tpr
12/07/2008  14:09            24 128 wudata_05.xtc
12/08/2008  15:15             5 364 wudata_05.xtc_tmp
11/15/2008  16:52           149 532 wudata_07.ckp
12/08/2008  19:47            69 100 wudata_07.dat
12/08/2008  19:47               560 wudata_07.edr
12/08/2008  19:47            86 789 wudata_07.log
12/08/2008  19:47           353 437 wudata_07.tpr
12/07/2008  21:52            13 176 wudata_07.trr_tmp
12/08/2008  19:47         1 756 756 wudata_07.xtc
12/08/2008  21:08           157 812 wudata_08.ckp
12/08/2008  19:47            99 084 wudata_08.dat
12/08/2008  19:47               560 wudata_08.edr
12/08/2008  21:09            23 212 wudata_08.log
12/08/2008  19:47           488 045 wudata_08.tpr
12/08/2008  19:47            52 232 wudata_08.xtc
11/10/2008  23:02           149 532 wudata_09.ckp
12/08/2008  08:59            97 236 wudata_09.dat
12/08/2008  08:59               560 wudata_09.edr
12/08/2008  08:59            29 717 wudata_09.log
12/08/2008  08:59           485 009 wudata_09.tpr
11/21/2008  15:09            27 312 wudata_09.trr_tmp
12/08/2008  08:59           351 004 wudata_09.xtc
12/08/2008  15:13               512 wuinfo_01.dat
12/08/2008  15:13               512 wuinfo_02.dat
12/08/2008  15:14               512 wuinfo_03.dat
12/08/2008  15:14               512 wuinfo_04.dat
12/08/2008  15:15               512 wuinfo_05.dat
12/08/2008  19:47               512 wuinfo_07.dat
12/08/2008  21:08               512 wuinfo_08.dat
12/08/2008  08:59               512 wuinfo_09.dat
              79 File(s)      8 648 702 bytes
               2 Dir(s)   2 937 634 816 bytes free

C:\folding\work>
It seems that several .ckp files are not being overwritten and that is causing the issue:

Code: Select all

C:\folding\work>dir *.ckp
 Volume in drive C has no label.
 Volume Serial Number is 7E13-6968

 Directory of C:\folding\work

11/29/2008  16:13           149 532 wudata_01.ckp
11/22/2008  05:08           149 532 wudata_02.ckp
12/07/2008  14:05           149 532 wudata_03.ckp
11/24/2008  21:48           149 532 wudata_04.ckp
11/22/2008  20:40           149 532 wudata_05.ckp
11/15/2008  16:52           149 532 wudata_07.ckp
12/08/2008  21:08           157 812 wudata_08.ckp
11/10/2008  23:02           149 532 wudata_09.ckp
               8 File(s)      1 204 536 bytes
               0 Dir(s)   2 937 630 720 bytes free

C:\folding\work>
If that is correct then only 2 or 3 WUs out of 10 can ever be submitted by my FAH client. I have to investigate....

Update:

As suspected - 3 WU slots are good and 7 broken by "invalid" .ckp file:

Code: Select all

[13:05:25] Completed 98%
[13:09:05] Completed 99%
[13:12:45] Completed 100%
[13:12:45] Successful run
[13:12:45] DynamicWrapper: Finished Work Unit: sleep=10000
[13:12:56] Reserved 220984 bytes for xtc file; Cosm status=0
[13:12:56] Allocated 220984 bytes for xtc file
[13:12:56] - Reading up to 220984 from "work/wudata_00.xtc": Read 220984
[13:12:56] Read 220984 bytes from xtc file; available packet space=786209480
[13:12:56] xtc file hash check passed.
[13:12:56] Reserved 33528 33528 786209480 bytes for arc file=<work/wudata_00.trr> Cosm status=0
[13:12:56] Allocated 33528 bytes for arc file
[13:12:56] - Reading up to 33528 from "work/wudata_00.trr": Read 33528
[13:12:56] Read 33528 bytes from arc file; available packet space=786175952
[13:12:56] trr file hash check passed.
[13:12:56] Allocated 560 bytes for edr file
[13:12:56] Read bedfile
[13:12:56] edr file hash check passed.
[13:12:56] Allocated 55022 bytes for logfile
[13:12:56] Read logfile
[13:12:56] GuardedRun: success in DynamicWrapper
[13:12:56] GuardedRun: done
[13:12:56] Run: GuardedRun completed.
[13:12:59] - Writing 310606 bytes of core data to disk...
[13:13:00] Done: 310094 -> 266916 (compressed to 86.0 percent)
[13:13:00]   ... Done.
[13:13:00] - Shutting down core 
[13:13:00] 
[13:13:00] Folding@home Core Shutdown: FINISHED_UNIT
[13:13:03] CoreStatus = 64 (100)
[13:13:03] Unit 0 finished with 91 percent of time to deadline remaining.
[13:13:03] Updated performance fraction: 0.893249
[13:13:03] Sending work to server
[13:13:03] Project: 5723 (Run 4, Clone 0, Gen 5)


[13:13:03] + Attempting to send results [December 8 13:13:03 UTC]
[13:13:03] - Reading file work/wuresults_00.dat from core
[13:13:03]   (Read 267428 bytes from disk)
[13:13:03] Connecting to http://171.64.65.102:8080/
[13:13:07] Posted data.
[13:13:07] Initial: 0000; - Uploaded at ~52 kB/s
[13:13:08] - Averaged speed for that direction ~50 kB/s
[13:13:08] + Results successfully sent
[13:13:08] Thank you for your contribution to Folding@Home.
[13:13:08] + Number of Units Completed: 329

[13:13:12] Trying to send all finished work units
[13:13:12] + No unsent completed units remaining.
[13:13:12] - Preparing to get new work unit...
[13:13:12] + Attempting to get work packet
[13:13:12] - Will indicate memory of 3327 MB
[13:13:12] - Detect CPU. Vendor: AuthenticAMD, Family: 15, Model: 2, Stepping: 2
[13:13:12] - Connecting to assignment server
[13:13:12] Connecting to http://assign-GPU.stanford.edu:8080/
[13:13:13] Posted data.
[13:13:13] Initial: 40AB; - Successful: assigned to (171.64.65.102).
[13:13:13] + News From Folding@Home: GPU folding beta
[13:13:13] Loaded queue successfully.
[13:13:13] Connecting to http://171.64.65.102:8080/
[13:13:14] Posted data.
[13:13:14] Initial: 0000; - Receiving payload (expected size: 68962)
[13:13:15] - Downloaded at ~67 kB/s
[13:13:15] - Averaged speed for that direction ~59 kB/s
[13:13:15] + Received work.
[13:13:15] Trying to send all finished work units
[13:13:15] + No unsent completed units remaining.
[13:13:15] + Closed connections
[13:13:15] 
[13:13:15] + Processing work unit
[13:13:15] Core required: FahCore_12.exe
[13:13:15] Core found.
[13:13:15] Working on queue slot 01 [December 8 13:13:15 UTC]
[13:13:15] + Working ...
[13:13:15] - Calling '.\FahCore_12.exe -dir work/ -suffix 01 -checkpoint 15 -service -forceasm -verbose -lifeline 864 -version 620'

[13:13:15] 
[13:13:15] *------------------------------*
[13:13:15] Folding@Home GPU Core - Beta
[13:13:15] Version 1.21 (Tue Nov 25 14:54:02 PST 2008)
[13:13:15] 
[13:13:15] Compiler  : Microsoft (R) 32-bit C/C++ Optimizing Compiler Version 14.00.50727.762 for 80x86 
[13:13:15] Build host: amoeba
[13:13:15] Board Type: AMD
[13:13:15] Core      : 
[13:13:15] Preparing to commence simulation
[13:13:15] - Assembly optimizations manually forced on.
[13:13:15] - Not checking prior termination.
[13:13:15] - Expanded 68450 -> 357580 (decompressed 522.3 percent)
[13:13:15] Called DecompressByteArray: compressed_data_size=68450 data_size=357580, decompressed_data_size=357580 diff=0
[13:13:15] - Digital signature verified
[13:13:15] 
[13:13:15] Project: 5731 (Run 0, Clone 0, Gen 40)
[13:13:15] 
[13:13:15] Assembly optimizations on if available.
[13:13:15] Entering M.D.
[13:13:21] Will resume from checkpoint file
[13:13:22] Working on Protein
[13:13:22] Client config found, loading data.
[13:13:22] Starting GUI Server
[13:13:31] Resuming from checkpoint
[13:13:31] Verified work/wudata_01.log
[13:13:31] Verified work/wudata_01.edr
[13:13:31] Verified work/wudata_01.xtc
[13:13:31] Completed 9%
[13:13:31] mdrun_gpu returned 
[13:13:31] Calculated & specified T inconsisitent
[13:13:31] 
[13:13:31] Folding@home Core Shutdown: UNSTABLE_MACHINE
[13:13:36] CoreStatus = 7A (122)
[13:13:36] Sending work to server
[13:13:36] Project: 5731 (Run 0, Clone 0, Gen 40)
[13:13:36] - Error: Could not get length of results file work/wuresults_01.dat
[13:13:36] - Error: Could not read unit 01 file. Removing from queue.
[13:13:36] Trying to send all finished work units
[13:13:36] + No unsent completed units remaining.
[13:13:36] - Preparing to get new work unit...
[13:13:36] + Attempting to get work packet
[13:13:36] - Will indicate memory of 3327 MB
[13:13:36] - Connecting to assignment server
[13:13:36] Connecting to http://assign-GPU.stanford.edu:8080/
[13:13:36] Posted data.
[13:13:36] Initial: 40AB; - Successful: assigned to (171.64.65.102).
[13:13:36] + News From Folding@Home: GPU folding beta
[13:13:37] Loaded queue successfully.
[13:13:37] Connecting to http://171.64.65.102:8080/
[13:13:37] Posted data.
[13:13:37] Initial: 0000; - Receiving payload (expected size: 99000)
[13:13:39] - Downloaded at ~48 kB/s
[13:13:39] - Averaged speed for that direction ~56 kB/s
[13:13:39] + Received work.
[13:13:39] Trying to send all finished work units
[13:13:39] + No unsent completed units remaining.
[13:13:39] + Closed connections
[13:13:44] 
[13:13:44] + Processing work unit
[13:13:44] Core required: FahCore_12.exe
[13:13:44] Core found.
[13:13:44] Working on queue slot 02 [December 8 13:13:44 UTC]
[13:13:44] + Working ...
[13:13:44] - Calling '.\FahCore_12.exe -dir work/ -suffix 02 -checkpoint 15 -service -forceasm -verbose -lifeline 864 -version 620'

[13:13:44] 
[13:13:44] *------------------------------*
[13:13:44] Folding@Home GPU Core - Beta
[13:13:44] Version 1.21 (Tue Nov 25 14:54:02 PST 2008)
[13:13:44] 
[13:13:44] Compiler  : Microsoft (R) 32-bit C/C++ Optimizing Compiler Version 14.00.50727.762 for 80x86 
[13:13:44] Build host: amoeba
[13:13:44] Board Type: AMD
[13:13:44] Core      : 
[13:13:44] Preparing to commence simulation
[13:13:44] - Assembly optimizations manually forced on.
[13:13:44] - Not checking prior termination.
[13:13:44] - Expanded 98488 -> 492188 (decompressed 499.7 percent)
[13:13:44] Called DecompressByteArray: compressed_data_size=98488 data_size=492188, decompressed_data_size=492188 diff=0
[13:13:44] - Digital signature verified
[13:13:44] 
[13:13:44] Project: 5716 (Run 1, Clone 0, Gen 39)
[13:13:44] 
[13:13:44] Assembly optimizations on if available.
[13:13:44] Entering M.D.
[13:13:50] Will resume from checkpoint file
[13:13:50] Working on Protein
[13:13:51] Client config found, loading data.
[13:13:51] Starting GUI Server
[13:14:00] Resuming from checkpoint
[13:14:00] fcSaveRestoreState: I/O failed dir=0, var=00DF69B8, varsize=16704
[13:14:00] mdrun_gpu returned 
[13:14:00] Checkpoint failure
[13:14:00] 
[13:14:00] Folding@home Core Shutdown: UNSTABLE_MACHINE
[13:14:04] CoreStatus = 7A (122)
[13:14:04] Sending work to server
[13:14:04] Project: 5716 (Run 1, Clone 0, Gen 39)
[13:14:04] - Error: Could not get length of results file work/wuresults_02.dat
[13:14:04] - Error: Could not read unit 02 file. Removing from queue.
[13:14:04] Trying to send all finished work units
[13:14:04] + No unsent completed units remaining.
[13:14:04] - Preparing to get new work unit...
[13:14:04] + Attempting to get work packet
[13:14:04] - Will indicate memory of 3327 MB
[13:14:04] - Connecting to assignment server
[13:14:04] Connecting to http://assign-GPU.stanford.edu:8080/
[13:14:05] Posted data.
[13:14:05] Initial: 40AB; - Successful: assigned to (171.64.65.102).
[13:14:05] + News From Folding@Home: GPU folding beta
[13:14:05] Loaded queue successfully.
[13:14:05] Connecting to http://171.64.65.102:8080/
[13:14:06] Posted data.
[13:14:06] Initial: 0000; - Receiving payload (expected size: 99093)
[13:14:07] - Downloaded at ~96 kB/s
[13:14:07] - Averaged speed for that direction ~64 kB/s
[13:14:07] + Received work.
[13:14:07] Trying to send all finished work units
[13:14:07] + No unsent completed units remaining.
[13:14:07] + Closed connections
[13:14:12] 
[13:14:12] + Processing work unit
[13:14:12] Core required: FahCore_12.exe
[13:14:12] Core found.
[13:14:12] Working on queue slot 03 [December 8 13:14:12 UTC]
[13:14:12] + Working ...
[13:14:12] - Calling '.\FahCore_12.exe -dir work/ -suffix 03 -checkpoint 15 -service -forceasm -verbose -lifeline 864 -version 620'

[13:14:13] 
[13:14:13] *------------------------------*
[13:14:13] Folding@Home GPU Core - Beta
[13:14:13] Version 1.21 (Tue Nov 25 14:54:02 PST 2008)
[13:14:13] 
[13:14:13] Compiler  : Microsoft (R) 32-bit C/C++ Optimizing Compiler Version 14.00.50727.762 for 80x86 
[13:14:13] Build host: amoeba
[13:14:13] Board Type: AMD
[13:14:13] Core      : 
[13:14:13] Preparing to commence simulation
[13:14:13] - Assembly optimizations manually forced on.
[13:14:13] - Not checking prior termination.
[13:14:13] - Expanded 98581 -> 492188 (decompressed 499.2 percent)
[13:14:13] Called DecompressByteArray: compressed_data_size=98581 data_size=492188, decompressed_data_size=492188 diff=0
[13:14:13] - Digital signature verified
[13:14:13] 
[13:14:13] Project: 5717 (Run 0, Clone 0, Gen 32)
[13:14:13] 
[13:14:13] Assembly optimizations on if available.
[13:14:13] Entering M.D.
[13:14:19] Will resume from checkpoint file
[13:14:19] Working on Protein
[13:14:20] Client config found, loading data.
[13:14:20] Starting GUI Server
[13:14:29] Resuming from checkpoint
[13:14:29] fcSaveRestoreState: I/O failed dir=0, var=00DF69B8, varsize=16704
[13:14:29] mdrun_gpu returned 
[13:14:29] Checkpoint failure
[13:14:29] 
[13:14:29] Folding@home Core Shutdown: UNSTABLE_MACHINE
[13:14:33] CoreStatus = 7A (122)
[13:14:33] Sending work to server
[13:14:33] Project: 5717 (Run 0, Clone 0, Gen 32)
[13:14:33] - Error: Could not get length of results file work/wuresults_03.dat
[13:14:33] - Error: Could not read unit 03 file. Removing from queue.
[13:14:33] Trying to send all finished work units
[13:14:33] + No unsent completed units remaining.
[13:14:33] - Preparing to get new work unit...
[13:14:33] + Attempting to get work packet
[13:14:33] - Will indicate memory of 3327 MB
[13:14:33] - Connecting to assignment server
[13:14:33] Connecting to http://assign-GPU.stanford.edu:8080/
[13:14:33] Posted data.
[13:14:33] Initial: 40AB; - Successful: assigned to (171.64.65.102).
[13:14:33] + News From Folding@Home: GPU folding beta
[13:14:34] Loaded queue successfully.
[13:14:34] Connecting to http://171.64.65.102:8080/
[13:14:34] Posted data.
[13:14:34] Initial: 0000; - Receiving payload (expected size: 99095)
[13:14:36] - Downloaded at ~48 kB/s
[13:14:36] - Averaged speed for that direction ~61 kB/s
[13:14:36] + Received work.
[13:14:36] Trying to send all finished work units
[13:14:36] + No unsent completed units remaining.
[13:14:36] + Closed connections
[13:14:41] 
[13:14:41] + Processing work unit
[13:14:41] Core required: FahCore_12.exe
[13:14:41] Core found.
[13:14:41] Working on queue slot 04 [December 8 13:14:41 UTC]
[13:14:41] + Working ...
[13:14:41] - Calling '.\FahCore_12.exe -dir work/ -suffix 04 -checkpoint 15 -service -forceasm -verbose -lifeline 864 -version 620'

[13:14:41] 
[13:14:41] *------------------------------*
[13:14:41] Folding@Home GPU Core - Beta
[13:14:41] Version 1.21 (Tue Nov 25 14:54:02 PST 2008)
[13:14:41] 
[13:14:41] Compiler  : Microsoft (R) 32-bit C/C++ Optimizing Compiler Version 14.00.50727.762 for 80x86 
[13:14:41] Build host: amoeba
[13:14:41] Board Type: AMD
[13:14:41] Core      : 
[13:14:41] Preparing to commence simulation
[13:14:41] - Assembly optimizations manually forced on.
[13:14:41] - Not checking prior termination.
[13:14:41] - Expanded 98583 -> 492188 (decompressed 499.2 percent)
[13:14:41] Called DecompressByteArray: compressed_data_size=98583 data_size=492188, decompressed_data_size=492188 diff=0
[13:14:41] - Digital signature verified
[13:14:41] 
[13:14:41] Project: 5718 (Run 2, Clone 0, Gen 23)
[13:14:41] 
[13:14:41] Assembly optimizations on if available.
[13:14:41] Entering M.D.
[13:14:47] Will resume from checkpoint file
[13:14:48] Working on Protein
[13:14:48] Client config found, loading data.
[13:14:48] Starting GUI Server
[13:14:57] Resuming from checkpoint
[13:14:57] fcSaveRestoreState: I/O failed dir=0, var=00DF69B8, varsize=16704
[13:14:57] mdrun_gpu returned 
[13:14:57] Checkpoint failure
[13:14:57] 
[13:14:57] Folding@home Core Shutdown: UNSTABLE_MACHINE
[13:15:01] CoreStatus = 7A (122)
[13:15:01] Sending work to server
[13:15:01] Project: 5718 (Run 2, Clone 0, Gen 23)
[13:15:01] - Error: Could not get length of results file work/wuresults_04.dat
[13:15:01] - Error: Could not read unit 04 file. Removing from queue.
[13:15:01] Trying to send all finished work units
[13:15:01] + No unsent completed units remaining.
[13:15:01] - Preparing to get new work unit...
[13:15:01] + Attempting to get work packet
[13:15:01] - Will indicate memory of 3327 MB
[13:15:01] - Connecting to assignment server
[13:15:01] Connecting to http://assign-GPU.stanford.edu:8080/
[13:15:02] Posted data.
[13:15:02] Initial: 40AB; - Successful: assigned to (171.64.65.102).
[13:15:02] + News From Folding@Home: GPU folding beta
[13:15:02] Loaded queue successfully.
[13:15:02] Connecting to http://171.64.65.102:8080/
[13:15:03] Posted data.
[13:15:03] Initial: 0000; - Receiving payload (expected size: 99077)
[13:15:05] - Downloaded at ~48 kB/s
[13:15:05] - Averaged speed for that direction ~58 kB/s
[13:15:05] + Received work.
[13:15:05] Trying to send all finished work units
[13:15:05] + No unsent completed units remaining.
[13:15:05] + Closed connections
[13:15:10] 
[13:15:10] + Processing work unit
[13:15:10] Core required: FahCore_12.exe
[13:15:10] Core found.
[13:15:10] Working on queue slot 05 [December 8 13:15:10 UTC]
[13:15:10] + Working ...
[13:15:10] - Calling '.\FahCore_12.exe -dir work/ -suffix 05 -checkpoint 15 -service -forceasm -verbose -lifeline 864 -version 620'

[13:15:10] 
[13:15:10] *------------------------------*
[13:15:10] Folding@Home GPU Core - Beta
[13:15:10] Version 1.21 (Tue Nov 25 14:54:02 PST 2008)
[13:15:10] 
[13:15:10] Compiler  : Microsoft (R) 32-bit C/C++ Optimizing Compiler Version 14.00.50727.762 for 80x86 
[13:15:10] Build host: amoeba
[13:15:10] Board Type: AMD
[13:15:10] Core      : 
[13:15:10] Preparing to commence simulation
[13:15:10] - Assembly optimizations manually forced on.
[13:15:10] - Not checking prior termination.
[13:15:10] - Expanded 98565 -> 492188 (decompressed 499.3 percent)
[13:15:10] Called DecompressByteArray: compressed_data_size=98565 data_size=492188, decompressed_data_size=492188 diff=0
[13:15:10] - Digital signature verified
[13:15:10] 
[13:15:10] Project: 5719 (Run 1, Clone 0, Gen 34)
[13:15:10] 
[13:15:10] Assembly optimizations on if available.
[13:15:10] Entering M.D.
[13:15:16] Will resume from checkpoint file
[13:15:16] Working on Protein
[13:15:17] Client config found, loading data.
[13:15:17] Starting GUI Server
[13:15:26] Resuming from checkpoint
[13:15:26] fcSaveRestoreState: I/O failed dir=0, var=00DF69B8, varsize=16704
[13:15:26] mdrun_gpu returned 
[13:15:26] Checkpoint failure
[13:15:26] 
[13:15:26] Folding@home Core Shutdown: UNSTABLE_MACHINE
[13:15:30] CoreStatus = 7A (122)
[13:15:30] Sending work to server
[13:15:30] Project: 5719 (Run 1, Clone 0, Gen 34)
[13:15:30] - Error: Could not get length of results file work/wuresults_05.dat
[13:15:30] - Error: Could not read unit 05 file. Removing from queue.
[13:15:30] EUE limit exceeded. Pausing 24 hours.
Xilikon
Posts: 155
Joined: Sun Dec 02, 2007 1:34 pm

Re: Project: 5722 (Run 2, Clone 0, Gen 17)

Post by Xilikon »

Using my "programmer logic", .ckp file = checkpoint marker ?
Image
Ivoshiee
Site Moderator
Posts: 822
Joined: Sun Dec 02, 2007 12:05 am
Location: Estonia

Re: Project: 5722 (Run 2, Clone 0, Gen 17)

Post by Ivoshiee »

Xilikon wrote:Using my "programmer logic", .ckp file = checkpoint marker ?
Maybe. I can not be only one having that issue. What will take to fix it?
friedrim
Pande Group Member
Posts: 48
Joined: Wed Apr 02, 2008 5:25 pm

Re: Project: 5722 (Run 2, Clone 0, Gen 17)

Post by friedrim »

It appears the program is trying to read ckp files that are smaller than the expected size -- at least that is what the error messages/code and your log file postings suggest. Could you delete the ckp files and/or directory and see if the problem recurs? I am not sure why the ckp files are being truncated -- the code that writes a checkpoint file and restarts a run from the checkpoint file has not been modified in quite a while.

Let me know if the errors recur, and I will see if I can reproduce them here.
Ivoshiee
Site Moderator
Posts: 822
Joined: Sun Dec 02, 2007 12:05 am
Location: Estonia

Re: Project: 5722 (Run 2, Clone 0, Gen 17)

Post by Ivoshiee »

friedrim wrote:It appears the program is trying to read ckp files that are smaller than the expected size -- at least that is what the error messages/code and your log file postings suggest. Could you delete the ckp files and/or directory and see if the problem recurs? I am not sure why the ckp files are being truncated -- the code that writes a checkpoint file and restarts a run from the checkpoint file has not been modified in quite a while.

Let me know if the errors recur, and I will see if I can reproduce them here.
I deleted all invalid .ckp files and let the client run (on slot 08). Then I conducted an experiment: while the slot 08 was still running, copied wudata_08.ckp to wudata_09.ckp. After slot 08 was finished, a new WU was loaded into slot 09, but again it was dumped at sight (with a bit different messages this time).

Code: Select all

[07:40:04] Completed 98%
[07:43:42] Completed 99%
[07:47:22] Completed 100%
[07:47:22] Successful run
[07:47:22] DynamicWrapper: Finished Work Unit: sleep=10000
[07:47:32] Reserved 219164 bytes for xtc file; Cosm status=0
[07:47:32] Allocated 219164 bytes for xtc file
[07:47:32] - Reading up to 219164 from "work/wudata_08.xtc": Read 219164
[07:47:32] Read 219164 bytes from xtc file; available packet space=786211300
[07:47:32] xtc file hash check passed.
[07:47:32] Reserved 33528 33528 786211300 bytes for arc file=<work/wudata_08.trr> Cosm status=0
[07:47:32] Allocated 33528 bytes for arc file
[07:47:32] - Reading up to 33528 from "work/wudata_08.trr": Read 33528
[07:47:32] Read 33528 bytes from arc file; available packet space=786177772
[07:47:32] trr file hash check passed.
[07:47:32] Allocated 560 bytes for edr file
[07:47:32] Read bedfile
[07:47:32] edr file hash check passed.
[07:47:32] Allocated 54812 bytes for logfile
[07:47:32] Read logfile
[07:47:32] GuardedRun: success in DynamicWrapper
[07:47:32] GuardedRun: done
[07:47:32] Run: GuardedRun completed.
[07:47:37] - Writing 308576 bytes of core data to disk...
[07:47:37] Done: 308064 -> 264440 (compressed to 85.8 percent)
[07:47:37]   ... Done.
[07:47:37] - Shutting down core 
[07:47:37] 
[07:47:37] Folding@home Core Shutdown: FINISHED_UNIT
[07:47:41] CoreStatus = 64 (100)
[07:47:41] Unit 8 finished with 81 percent of time to deadline remaining.
[07:47:41] Updated performance fraction: 0.883982
[07:47:41] Sending work to server
[07:47:41] Project: 5716 (Run 2, Clone 0, Gen 21)


[07:47:41] + Attempting to send results [December 9 07:47:41 UTC]
[07:47:41] - Reading file work/wuresults_08.dat from core
[07:47:41]   (Read 264952 bytes from disk)
[07:47:41] Connecting to http://171.64.65.102:8080/
[07:47:46] Posted data.
[07:47:46] Initial: 0000; - Uploaded at ~43 kB/s
[07:47:47] - Averaged speed for that direction ~51 kB/s
[07:47:47] + Results successfully sent
[07:47:47] Thank you for your contribution to Folding@Home.
[07:47:47] + Number of Units Completed: 331

[07:47:51] Trying to send all finished work units
[07:47:51] + No unsent completed units remaining.
[07:47:51] - Preparing to get new work unit...
[07:47:51] + Attempting to get work packet
[07:47:51] - Will indicate memory of 3327 MB
[07:47:51] - Detect CPU. Vendor: AuthenticAMD, Family: 15, Model: 2, Stepping: 2
[07:47:51] - Connecting to assignment server
[07:47:51] Connecting to http://assign-GPU.stanford.edu:8080/
[07:47:52] Posted data.
[07:47:52] Initial: 40AB; - Successful: assigned to (171.64.65.102).
[07:47:52] + News From Folding@Home: GPU folding beta
[07:47:52] Loaded queue successfully.
[07:47:52] Connecting to http://171.64.65.102:8080/
[07:47:53] Posted data.
[07:47:53] Initial: 0000; - Receiving payload (expected size: 97069)
[07:47:55] - Downloaded at ~47 kB/s
[07:47:55] - Averaged speed for that direction ~47 kB/s
[07:47:55] + Received work.
[07:47:55] Trying to send all finished work units
[07:47:55] + No unsent completed units remaining.
[07:47:55] + Closed connections
[07:47:55] 
[07:47:55] + Processing work unit
[07:47:55] Core required: FahCore_12.exe
[07:47:55] Core found.
[07:47:55] Working on queue slot 09 [December 9 07:47:55 UTC]
[07:47:55] + Working ...
[07:47:55] - Calling '.\FahCore_12.exe -dir work/ -suffix 09 -checkpoint 15 -service -forceasm -verbose -lifeline 956 -version 620'

[07:47:55] 
[07:47:55] *------------------------------*
[07:47:55] Folding@Home GPU Core - Beta
[07:47:55] Version 1.21 (Tue Nov 25 14:54:02 PST 2008)
[07:47:55] 
[07:47:55] Compiler  : Microsoft (R) 32-bit C/C++ Optimizing Compiler Version 14.00.50727.762 for 80x86 
[07:47:55] Build host: amoeba
[07:47:55] Board Type: AMD
[07:47:55] Core      : 
[07:47:55] Preparing to commence simulation
[07:47:55] - Assembly optimizations manually forced on.
[07:47:55] - Not checking prior termination.
[07:47:55] - Expanded 96557 -> 489152 (decompressed 506.5 percent)
[07:47:55] Called DecompressByteArray: compressed_data_size=96557 data_size=489152, decompressed_data_size=489152 diff=0
[07:47:55] - Digital signature verified
[07:47:55] 
[07:47:55] Project: 5720 (Run 3, Clone 0, Gen 23)
[07:47:55] 
[07:47:55] Assembly optimizations on if available.
[07:47:55] Entering M.D.
[07:48:01] Will resume from checkpoint file
[07:48:02] Working on Protein
[07:48:02] Client config found, loading data.
[07:48:02] Starting GUI Server
[07:48:12] Resuming from checkpoint
[07:48:12] File work/wudata_08.log was open at checkpoint,but is now closedmdrun_gpu returned 
[07:48:12] Checkpoint failure
[07:48:12] 
[07:48:12] Folding@home Core Shutdown: UNSTABLE_MACHINE
[07:48:15] CoreStatus = 7A (122)
[07:48:15] Sending work to server
[07:48:15] Project: 5720 (Run 3, Clone 0, Gen 23)
[07:48:15] - Error: Could not get length of results file work/wuresults_09.dat
[07:48:15] - Error: Could not read unit 09 file. Removing from queue.
[07:48:15] Trying to send all finished work units
[07:48:15] + No unsent completed units remaining.
[07:48:15] - Preparing to get new work unit...
[07:48:15] + Attempting to get work packet
[07:48:15] - Will indicate memory of 3327 MB
[07:48:15] - Connecting to assignment server
[07:48:15] Connecting to http://assign-GPU.stanford.edu:8080/
[07:48:16] Posted data.
[07:48:16] Initial: 40AB; - Successful: assigned to (171.64.65.102).
[07:48:16] + News From Folding@Home: GPU folding beta
[07:48:16] Loaded queue successfully.
[07:48:16] Connecting to http://171.64.65.102:8080/
[07:48:17] Posted data.
[07:48:17] Initial: 0000; - Receiving payload (expected size: 97062)
[07:48:19] - Downloaded at ~47 kB/s
[07:48:19] - Averaged speed for that direction ~47 kB/s
[07:48:19] + Received work.
[07:48:19] Trying to send all finished work units
[07:48:19] + No unsent completed units remaining.
[07:48:19] + Closed connections
[07:48:24] 
[07:48:24] + Processing work unit
[07:48:24] Core required: FahCore_12.exe
[07:48:24] Core found.
[07:48:24] Working on queue slot 00 [December 9 07:48:24 UTC]
[07:48:24] + Working ...
[07:48:24] - Calling '.\FahCore_12.exe -dir work/ -suffix 00 -checkpoint 15 -service -forceasm -verbose -lifeline 956 -version 620'

[07:48:24] 
[07:48:24] *------------------------------*
[07:48:24] Folding@Home GPU Core - Beta
[07:48:24] Version 1.21 (Tue Nov 25 14:54:02 PST 2008)
[07:48:24] 
[07:48:24] Compiler  : Microsoft (R) 32-bit C/C++ Optimizing Compiler Version 14.00.50727.762 for 80x86 
[07:48:24] Build host: amoeba
[07:48:24] Board Type: AMD
[07:48:24] Core      : 
[07:48:24] Preparing to commence simulation
[07:48:24] - Assembly optimizations manually forced on.
[07:48:24] - Not checking prior termination.
[07:48:24] - Expanded 96550 -> 489152 (decompressed 506.6 percent)
[07:48:24] Called DecompressByteArray: compressed_data_size=96550 data_size=489152, decompressed_data_size=489152 diff=0
[07:48:24] - Digital signature verified
[07:48:24] 
[07:48:24] Project: 5721 (Run 0, Clone 0, Gen 36)
[07:48:24] 
[07:48:24] Assembly optimizations on if available.
[07:48:24] Entering M.D.
[07:48:30] Working on Protein
[07:48:31] Client config found, loading data.
[07:48:31] Starting GUI Server
[07:52:18] Completed 1%
[07:55:57] Completed 2%
[07:59:35] Completed 3%
[08:03:14] Completed 4%
[08:06:53] Completed 5%
[08:10:32] Completed 6%
To conclude:
It is not the size of a .ckp file what does cause that issue, but it's mere presence. What ever reasons it will be left behind, it will render that WU slot useless. Why does not the FAH client remove file(s) before starting a new WU?
friedrim
Pande Group Member
Posts: 48
Joined: Wed Apr 02, 2008 5:25 pm

Re: Project: 5722 (Run 2, Clone 0, Gen 17)

Post by friedrim »

I was unable to reproduce the problem here -- the logic associated w/ the ckp files works as desired.

Background:
The ckp files are not removed unless the core terminates successfully. The idea is that if the core dies
prematurely, the next run can start near the same point where the previous run was before it died by reading
the state of the simulation from the ckp file. Hence, it would not make sense to delete the ckp file
unless the job successfully completed.

Your experiment is invalid: the core has no way of knowing that the 09 ckp file copied to the 08 slot is invalid;
for example, if the runs are for different proteins, the size of the arrays, ... will be incorrect.
When files are moved/renamed like this, in general bad things will happen.

For the 'non-experimential' runs:

Are the runs w/ ckp files present running to completion and then for some reason the ckp files are not getting deleted?
If the jobs are terminating prematurely, any idea why the ckp files are getting corrupted?
Ivoshiee
Site Moderator
Posts: 822
Joined: Sun Dec 02, 2007 12:05 am
Location: Estonia

Re: Project: 5722 (Run 2, Clone 0, Gen 17)

Post by Ivoshiee »

friedrim wrote:I was unable to reproduce the problem here -- the logic associated w/ the ckp files works as desired.

Background:
The ckp files are not removed unless the core terminates successfully. The idea is that if the core dies
prematurely, the next run can start near the same point where the previous run was before it died by reading
the state of the simulation from the ckp file. Hence, it would not make sense to delete the ckp file
unless the job successfully completed.
But there is nothing to start from checkpoint if the FAH client has just started a new WU - .ckp file being left behind by some other WU has nothing to do with current one, so why do not delete it before starting a new WU?

Your experiment is invalid: the core has no way of knowing that the 09 ckp file copied to the 08 slot is invalid;
for example, if the runs are for different proteins, the size of the arrays, ... will be incorrect.
When files are moved/renamed like this, in general bad things will happen.
I would argue here about it. I would state than .ckp logic is a bit invalid here and we would not have that error if .ckp file will get removed before a new WU will get loaded to that particular WU slot - despite it's content this file is ALWAYS invalid for a new WU. In essence my experiment is basically the same thing I have with the FAH client - there are .cpk files on 7 out of 10 WU slots. Why those files are there or are these corrupt? I have no way of detecting it, but their mere presence does dump 7 WUs out of 10 for me. If that is not an issue then what will make it one?
For the 'non-experimential' runs:

Are the runs w/ ckp files present running to completion and then for some reason the ckp files are not getting deleted?
I do not know. If you see the .ckp file dates then these are usually a week apart or so. I tend to run the system during a day and it should make 1 or even 2 WUs per day for me. It is safe to say that 1 out of 7 WUs will cause that issue.
If the jobs are terminating prematurely, any idea why the ckp files are getting corrupted?
I tried rebooting, but couple of those didn't do a trick. Maybe killing the FAH cores in "proper moment" will do it, but I doubt that I am lucky enough.
Post Reply