Timeout/Expiration limits and specific work units.
Posted: Tue Feb 04, 2014 12:59 am
As mentioned in another thread, I recently had to dump a series of work units recently because not only couldn't I complete within the timeout limit, they wouldn't make it before hitting the expiration limit either.
The box is a quad-core, it's available for ~ 8.5 hours per day, and a lot of the time I can let fahclient have all of the CPU's, but sometimes (like at the moment) I'm playing with VM's and need to drop it back to two or three cores otherwise the box is unusable. With this particular WU, it was assigned on a Thursday, spent most of Friday processing... and then I realized that it was set to timeout on the weekend and expire on the Tuesday, and at the rate it had progressed it needed around 34 - 36 hours of processing in total (two cores). Only a few days before I had to dump a WU because on the full four cores it would have required ~28 hours to complete, but on two cores it was heading towards ~100 hours of processing.
Usually, expiration and timeout values are more than sufficient to complete the work, but on occasions they just seem completely unrealistic, and recently it has been more than 'occasional'. Possibly this is because the performance on two cores isn't 50% of four cores, it's more like 30% (based on some back-of-envelope calculations).
Log files below :
And here's the previous one I dumped after realizing that it just wasn't going to complete in time (there was a long weekend in the middle etc. etc.).
Mod Edit: Added Code Tags - PantherX
The box is a quad-core, it's available for ~ 8.5 hours per day, and a lot of the time I can let fahclient have all of the CPU's, but sometimes (like at the moment) I'm playing with VM's and need to drop it back to two or three cores otherwise the box is unusable. With this particular WU, it was assigned on a Thursday, spent most of Friday processing... and then I realized that it was set to timeout on the weekend and expire on the Tuesday, and at the rate it had progressed it needed around 34 - 36 hours of processing in total (two cores). Only a few days before I had to dump a WU because on the full four cores it would have required ~28 hours to complete, but on two cores it was heading towards ~100 hours of processing.
Usually, expiration and timeout values are more than sufficient to complete the work, but on occasions they just seem completely unrealistic, and recently it has been more than 'occasional'. Possibly this is because the performance on two cores isn't 50% of four cores, it's more like 30% (based on some back-of-envelope calculations).
Log files below :
Code: Select all
*********************** Log Started 2014-01-30T22:17:58Z ***********************
22:17:58:************************* Folding@home Client *************************
22:17:58: Website: http://folding.stanford.edu/
22:17:58: Copyright: (c) 2009-2013 Stanford University
22:17:58: Author: Joseph Coffland <joseph@cauldrondevelopment.com>
22:17:58: Args:
22:17:58: Config: C:/LocalFiles/FAHClient/config.xml
22:17:58:******************************** Build ********************************
22:17:58: Version: 7.3.6
22:17:58: Date: Feb 18 2013
22:17:58: Time: 15:25:17
22:17:58: SVN Rev: 3923
22:17:58: Branch: fah/trunk/client
22:17:58: Compiler: Intel(R) C++ MSVC 1500 mode 1200
22:17:58: Options: /TP /nologo /EHa /Qdiag-disable:4297,4103,1786,279 /Ox -arch:SSE
22:17:58: /QaxSSE2,SSE3,SSSE3,SSE4.1,SSE4.2 /Qopenmp /Qrestrict /MT /Qmkl
22:17:58: Platform: win32 XP
22:17:58: Bits: 32
22:17:58: Mode: Release
22:17:58:******************************* System ********************************
22:17:58: CPU: Intel(R) Core(TM) i5-3550 CPU @ 3.30GHz
22:17:58: CPU ID: GenuineIntel Family 6 Model 58 Stepping 9
22:17:58: CPUs: 4
22:17:58: Memory: 3.95GiB
22:17:58: Free Memory: 1.76GiB
22:17:58: Threads: WINDOWS_THREADS
22:17:58: Has Battery: false
22:17:58: On Battery: false
22:17:58: UTC offset: 11
22:17:58: PID: 6280
22:17:58: CWD: C:/LocalFiles/FAHClient
22:17:58: OS: Windows 7 Enterprise
22:17:58: OS Arch: AMD64
22:17:58: GPUs: 2
22:17:58: GPU 0: ATI:5 Caicos [Radeon HD 7000 Series]
22:17:58: GPU 1: ATI:4 Cedar PRO [Radeon HD 5450]
22:17:58: CUDA: Not detected
22:17:58:Win32 Service: false
22:17:58:***********************************************************************
22:17:58:<config>
22:17:58: <!-- Folding Core -->
22:17:58: <checkpoint v='5'/>
22:17:58: <cpu-affinity v='true'/>
22:17:58: <cpu-usage v='75'/>
22:17:58:
22:17:58: <!-- Folding Slot Configuration -->
22:17:58: <cause v='CANCER'/>
22:17:58:
22:17:58: <!-- Network -->
xxxxxxxxxxx
22:17:58:
22:17:58: <!-- User Information -->
22:17:58: <user v='Twizzle'/>
22:17:58:
22:17:58: <!-- Folding Slots -->
22:17:58: <slot id='0' type='CPU'>
22:17:58: <cpus v='-1'/>
22:17:58: </slot>
22:17:58:</config>
22:17:58:Trying to access database...
22:17:58:Successfully acquired database lock
22:17:58:Enabled folding slot 00: READY cpu:3
22:17:58:WU01:FS00:Starting
22:17:58:WARNING:WU01:FS00:Changed SMP threads from 2 to 3 this can cause some work units to fail
22:17:58:WU01:FS00:Running FahCore: "C:\Program Files (x86)\FAHClient/FAHCoreWrapper.exe" C:/LocalFiles/FAHClient/cores/www.stanford.edu/~pande/Win32/AMD64/Core_a3.fah/FahCore_a3.exe -dir 01 -suffix 01 -version 703 -lifeline 6280 -checkpoint 5 -cpu 75 -np 3
22:17:58:WU01:FS00:Started FahCore on PID 5252
22:17:58:WU01:FS00:Core PID:6836
22:17:58:WU01:FS00:FahCore 0xa3 started
22:17:59:WU01:FS00:0xa3:
22:17:59:WU01:FS00:0xa3:*------------------------------*
22:17:59:WU01:FS00:0xa3:Folding@Home Gromacs SMP Core
22:17:59:WU01:FS00:0xa3:Version 2.27 (Dec. 15, 2010)
22:17:59:WU01:FS00:0xa3:
22:17:59:WU01:FS00:0xa3:Preparing to commence simulation
22:17:59:WU01:FS00:0xa3:- Looking at optimizations...
22:17:59:WU01:FS00:0xa3:- Files status OK
22:17:59:WU01:FS00:0xa3:- Expanded 1766091 -> 2700840 (decompressed 152.9 percent)
22:17:59:WU01:FS00:0xa3:Called DecompressByteArray: compressed_data_size=1766091 data_size=2700840, decompressed_data_size=2700840 diff=0
22:17:59:WU01:FS00:0xa3:- Digital signature verified
22:17:59:WU01:FS00:0xa3:
22:17:59:WU01:FS00:0xa3:Project: 7504 (Run 5, Clone 34, Gen 64)
22:17:59:WU01:FS00:0xa3:
22:17:59:WU01:FS00:0xa3:Assembly optimizations on if available.
22:17:59:WU01:FS00:0xa3:Entering M.D.
22:18:05:WU01:FS00:0xa3:Using Gromacs checkpoints
22:18:06:WU01:FS00:0xa3:Mapping NT from 3 to 3
22:18:06:WU01:FS00:0xa3:Resuming from checkpoint
22:18:07:WU01:FS00:0xa3:Verified 01/wudata_01.log
22:18:07:WU01:FS00:0xa3:Verified 01/wudata_01.trr
22:18:07:WU01:FS00:0xa3:Verified 01/wudata_01.xtc
22:18:07:WU01:FS00:0xa3:Verified 01/wudata_01.edr
22:18:07:WU01:FS00:0xa3:Completed 29130 out of 500000 steps (5%)
22:19:45:WU01:FS00:0xa3:Completed 30000 out of 500000 steps (6%)
22:28:23:WU01:FS00:0xa3:Completed 35000 out of 500000 steps (7%)
22:35:34:FS00:Shutting core down
22:35:35:WU01:FS00:FahCore returned: INTERRUPTED (102 = 0x66)
22:35:35:WU01:FS00:Starting
22:35:35:WARNING:WU01:FS00:Changed SMP threads from 3 to 4 this can cause some work units to fail
22:35:35:WU01:FS00:Running FahCore: "C:\Program Files (x86)\FAHClient/FAHCoreWrapper.exe" C:/LocalFiles/FAHClient/cores/www.stanford.edu/~pande/Win32/AMD64/Core_a3.fah/FahCore_a3.exe -dir 01 -suffix 01 -version 703 -lifeline 6280 -checkpoint 5 -cpu 75 -np 4
22:35:35:WU01:FS00:Started FahCore on PID 6520
22:35:35:WU01:FS00:Core PID:5616
22:35:35:WU01:FS00:FahCore 0xa3 started
22:35:36:WU01:FS00:0xa3:
22:35:36:WU01:FS00:0xa3:*------------------------------*
22:35:36:WU01:FS00:0xa3:Folding@Home Gromacs SMP Core
22:35:36:WU01:FS00:0xa3:Version 2.27 (Dec. 15, 2010)
22:35:36:WU01:FS00:0xa3:
22:35:36:WU01:FS00:0xa3:Preparing to commence simulation
22:35:36:WU01:FS00:0xa3:- Looking at optimizations...
22:35:36:WU01:FS00:0xa3:- Files status OK
22:35:36:WU01:FS00:0xa3:- Expanded 1766091 -> 2700840 (decompressed 152.9 percent)
22:35:36:WU01:FS00:0xa3:Called DecompressByteArray: compressed_data_size=1766091 data_size=2700840, decompressed_data_size=2700840 diff=0
22:35:36:WU01:FS00:0xa3:- Digital signature verified
22:35:36:WU01:FS00:0xa3:
22:35:36:WU01:FS00:0xa3:Project: 7504 (Run 5, Clone 34, Gen 64)
22:35:36:WU01:FS00:0xa3:
22:35:36:WU01:FS00:0xa3:Assembly optimizations on if available.
22:35:36:WU01:FS00:0xa3:Entering M.D.
22:35:42:WU01:FS00:0xa3:Using Gromacs checkpoints
22:35:42:WU01:FS00:0xa3:Mapping NT from 4 to 4
22:35:44:WU01:FS00:0xa3:Resuming from checkpoint
22:35:44:WU01:FS00:0xa3:Verified 01/wudata_01.log
22:35:44:WU01:FS00:0xa3:Verified 01/wudata_01.trr
22:35:44:WU01:FS00:0xa3:Verified 01/wudata_01.xtc
22:35:44:WU01:FS00:0xa3:Verified 01/wudata_01.edr
22:35:44:WU01:FS00:0xa3:Completed 37770 out of 500000 steps (7%)
22:38:49:WU01:FS00:0xa3:Completed 40000 out of 500000 steps (8%)
22:45:27:WU01:FS00:0xa3:Completed 45000 out of 500000 steps (9%)
22:52:06:WU01:FS00:0xa3:Completed 50000 out of 500000 steps (10%)
22:58:45:WU01:FS00:0xa3:Completed 55000 out of 500000 steps (11%)
23:05:22:WU01:FS00:0xa3:Completed 60000 out of 500000 steps (12%)
23:13:16:WU01:FS00:0xa3:Completed 65000 out of 500000 steps (13%)
23:21:25:WU01:FS00:0xa3:Completed 70000 out of 500000 steps (14%)
23:28:28:WU01:FS00:0xa3:Completed 75000 out of 500000 steps (15%)
23:35:23:WU01:FS00:0xa3:Completed 80000 out of 500000 steps (16%)
23:42:12:WU01:FS00:0xa3:Completed 85000 out of 500000 steps (17%)
23:49:21:WU01:FS00:0xa3:Completed 90000 out of 500000 steps (18%)
23:55:56:WU01:FS00:0xa3:Completed 95000 out of 500000 steps (19%)
00:02:55:WU01:FS00:0xa3:Completed 100000 out of 500000 steps (20%)
00:09:39:WU01:FS00:0xa3:Completed 105000 out of 500000 steps (21%)
00:17:03:WU01:FS00:0xa3:Completed 110000 out of 500000 steps (22%)
00:19:51:FS00:Shutting core down
00:19:54:WU01:FS00:0xa3:Client no longer detected. Shutting down core
00:19:54:WU01:FS00:0xa3:
00:19:54:WU01:FS00:0xa3:Folding@home Core Shutdown: CLIENT_DIED
00:19:55:WU01:FS00:FahCore returned: INTERRUPTED (102 = 0x66)
00:19:55:WU01:FS00:Starting
00:19:55:WARNING:WU01:FS00:Changed SMP threads from 4 to 2 this can cause some work units to fail
00:19:55:WU01:FS00:Running FahCore: "C:\Program Files (x86)\FAHClient/FAHCoreWrapper.exe" C:/LocalFiles/FAHClient/cores/www.stanford.edu/~pande/Win32/AMD64/Core_a3.fah/FahCore_a3.exe -dir 01 -suffix 01 -version 703 -lifeline 6280 -checkpoint 5 -cpu 75 -np 2
00:19:55:WU01:FS00:Started FahCore on PID 7748
00:19:55:WU01:FS00:Core PID:668
00:19:55:WU01:FS00:FahCore 0xa3 started
00:19:56:WU01:FS00:0xa3:
00:19:56:WU01:FS00:0xa3:*------------------------------*
00:19:56:WU01:FS00:0xa3:Folding@Home Gromacs SMP Core
00:19:56:WU01:FS00:0xa3:Version 2.27 (Dec. 15, 2010)
00:19:56:WU01:FS00:0xa3:
00:19:56:WU01:FS00:0xa3:Preparing to commence simulation
00:19:56:WU01:FS00:0xa3:- Looking at optimizations...
00:19:56:WU01:FS00:0xa3:- Files status OK
00:19:56:WU01:FS00:0xa3:- Expanded 1766091 -> 2700840 (decompressed 152.9 percent)
00:19:56:WU01:FS00:0xa3:Called DecompressByteArray: compressed_data_size=1766091 data_size=2700840, decompressed_data_size=2700840 diff=0
00:19:56:WU01:FS00:0xa3:- Digital signature verified
00:19:56:WU01:FS00:0xa3:
00:19:56:WU01:FS00:0xa3:Project: 7504 (Run 5, Clone 34, Gen 64)
00:19:56:WU01:FS00:0xa3:
00:19:56:WU01:FS00:0xa3:Assembly optimizations on if available.
00:19:56:WU01:FS00:0xa3:Entering M.D.
00:20:02:WU01:FS00:0xa3:Using Gromacs checkpoints
00:20:02:WU01:FS00:0xa3:Mapping NT from 2 to 2
00:20:03:WU01:FS00:0xa3:Resuming from checkpoint
00:20:03:WU01:FS00:0xa3:Verified 01/wudata_01.log
00:20:03:WU01:FS00:0xa3:Verified 01/wudata_01.trr
00:20:03:WU01:FS00:0xa3:Verified 01/wudata_01.xtc
00:20:03:WU01:FS00:0xa3:Verified 01/wudata_01.edr
00:20:03:WU01:FS00:0xa3:Completed 109200 out of 500000 steps (21%)
00:22:46:WU01:FS00:0xa3:Completed 110000 out of 500000 steps (22%)
00:44:48:WU01:FS00:0xa3:Completed 115000 out of 500000 steps (23%)
01:07:12:WU01:FS00:0xa3:Completed 120000 out of 500000 steps (24%)
01:25:02:WU01:FS00:0xa3:Completed 125000 out of 500000 steps (25%)
01:39:06:WU01:FS00:0xa3:Completed 130000 out of 500000 steps (26%)
01:55:34:WU01:FS00:0xa3:Completed 135000 out of 500000 steps (27%)
02:10:59:WU01:FS00:0xa3:Completed 140000 out of 500000 steps (28%)
02:32:36:WU01:FS00:0xa3:Completed 145000 out of 500000 steps (29%)
02:57:30:WU01:FS00:0xa3:Completed 150000 out of 500000 steps (30%)
03:20:38:WU01:FS00:0xa3:Completed 155000 out of 500000 steps (31%)
03:47:18:WU01:FS00:0xa3:Completed 160000 out of 500000 steps (32%)
04:10:45:WU01:FS00:0xa3:Completed 165000 out of 500000 steps (33%)
******************************* Date: 2014-01-31 *******************************
04:32:39:WU01:FS00:0xa3:Completed 170000 out of 500000 steps (34%)
04:57:34:WU01:FS00:0xa3:Completed 175000 out of 500000 steps (35%)
05:25:57:WU01:FS00:0xa3:Completed 180000 out of 500000 steps (36%)
05:54:35:WU01:FS00:0xa3:Completed 185000 out of 500000 steps (37%)
05:59:35:Removing old file 'configs/config-20130830-053052.xml'
05:59:35:Saving configuration to config.xml
05:59:35:<config>
05:59:35: <!-- Folding Core -->
05:59:35: <checkpoint v='5'/>
05:59:35: <cpu-affinity v='true'/>
05:59:35: <cpu-usage v='75'/>
05:59:35:
05:59:35: <!-- Folding Slot Configuration -->
05:59:35: <cause v='CANCER'/>
05:59:35: <power v='LIGHT'/>
05:59:35:
05:59:35: <!-- Network -->
05:59:35: <proxy v='172.25.105.80:8080'/>
05:59:35: <proxy-enable v='true'/>
05:59:35: <proxy-pass v='*******'/>
05:59:35: <proxy-user v='exitm2'/>
05:59:35:
05:59:35: <!-- User Information -->
05:59:35: <user v='Twizzle'/>
05:59:35:
05:59:35: <!-- Folding Slots -->
05:59:35:</config>
05:59:35:FS00:Shutting core down
05:59:39:WU01:FS00:FahCore returned: INTERRUPTED (102 = 0x66)
05:59:39:WARNING:WU01:Slot ID 0 no longer exists and there are no other matching slots, dumping
05:59:39:WU01:Sending unit results: id:01 state:SEND error:DUMPED project:7504 run:5 clone:34 gen:64 core:0xa3 unit:0x0000004ffbcb017d4e39b921cbbceeb5
05:59:39:WU01:Connecting to 172.25.105.80:8080
05:59:39:WARNING:WU01:Exception: Failed to send results to work server: Transfer failed
05:59:39:WU01:Trying to send results to collection server
05:59:39:WU01:Connecting to 172.25.105.80:8080
05:59:40:WU01:Server responded WORK_ACK (400)
05:59:40:WU01:Cleaning up
And here's the previous one I dumped after realizing that it just wasn't going to complete in time (there was a long weekend in the middle etc. etc.).
Code: Select all
22:48:43:WU01:FS02:0xa3:*------------------------------*
22:48:43:WU01:FS02:0xa3:Folding@Home Gromacs SMP Core
22:48:43:WU01:FS02:0xa3:Version 2.27 (Dec. 15, 2010)
22:48:43:WU01:FS02:0xa3:
22:48:43:WU01:FS02:0xa3:Preparing to commence simulation
22:48:43:WU01:FS02:0xa3:- Looking at optimizations...
22:48:43:WU01:FS02:0xa3:- Files status OK
22:48:43:WU01:FS02:0xa3:- Expanded 3849950 -> 4393460 (decompressed 114.1 percent)
22:48:43:WU01:FS02:0xa3:Called DecompressByteArray: compressed_data_size=3849950 data_size=4393460, decompressed_data_size=4393460 diff=0
22:48:44:WU01:FS02:0xa3:- Digital signature verified
22:48:44:WU01:FS02:0xa3:
22:48:44:WU01:FS02:0xa3:Project: 8573 (Run 0, Clone 5, Gen 383)
22:48:44:WU01:FS02:0xa3:
22:48:44:WU01:FS02:0xa3:Assembly optimizations on if available.
22:48:44:WU01:FS02:0xa3:Entering M.D.
22:48:50:WU01:FS02:0xa3:Using Gromacs checkpoints
22:48:50:WU01:FS02:0xa3:Mapping NT from 4 to 4
22:48:51:WU01:FS02:0xa3:Resuming from checkpoint
22:48:51:WU01:FS02:0xa3:Verified 01/wudata_01.log
22:48:52:WU01:FS02:0xa3:Verified 01/wudata_01.trr
22:48:52:WU01:FS02:0xa3:Verified 01/wudata_01.edr
22:48:52:WU01:FS02:0xa3:Completed 124685 out of 500000 steps (24%)
22:49:58:WU01:FS02:0xa3:Completed 125000 out of 500000 steps (25%)
23:07:14:WU01:FS02:0xa3:Completed 130000 out of 500000 steps (26%)
23:24:32:WU01:FS02:0xa3:Completed 135000 out of 500000 steps (27%)
23:41:50:WU01:FS02:0xa3:Completed 140000 out of 500000 steps (28%)
23:58:51:WU01:FS02:0xa3:Completed 145000 out of 500000 steps (29%)
00:16:00:WU01:FS02:0xa3:Completed 150000 out of 500000 steps (30%)
00:33:07:WU01:FS02:0xa3:Completed 155000 out of 500000 steps (31%)
00:50:17:WU01:FS02:0xa3:Completed 160000 out of 500000 steps (32%)
01:07:20:WU01:FS02:0xa3:Completed 165000 out of 500000 steps (33%)
01:24:32:WU01:FS02:0xa3:Completed 170000 out of 500000 steps (34%)
01:42:00:WU01:FS02:0xa3:Completed 175000 out of 500000 steps (35%)
01:59:36:WU01:FS02:0xa3:Completed 180000 out of 500000 steps (36%)
02:17:08:WU01:FS02:0xa3:Completed 185000 out of 500000 steps (37%)
02:34:13:WU01:FS02:0xa3:Completed 190000 out of 500000 steps (38%)
02:51:46:WU01:FS02:0xa3:Completed 195000 out of 500000 steps (39%)
03:11:09:WU01:FS02:0xa3:Completed 200000 out of 500000 steps (40%)
03:31:46:WU01:FS02:0xa3:Completed 205000 out of 500000 steps (41%)
03:49:23:WU01:FS02:0xa3:Completed 210000 out of 500000 steps (42%)
04:06:54:WU01:FS02:0xa3:Completed 215000 out of 500000 steps (43%)
04:24:43:WU01:FS02:0xa3:Completed 220000 out of 500000 steps (44%)
******************************* Date: 2014-01-22 *******************************
04:41:57:WU01:FS02:0xa3:Completed 225000 out of 500000 steps (45%)
04:59:12:WU01:FS02:0xa3:Completed 230000 out of 500000 steps (46%)
05:18:09:WU01:FS02:0xa3:Completed 235000 out of 500000 steps (47%)
05:36:34:WU01:FS02:0xa3:Completed 240000 out of 500000 steps (48%)
05:53:45:WU01:FS02:0xa3:Completed 245000 out of 500000 steps (49%)