Project: 14004 (Run:0, Clone:115, Gen:16)
Posted: Tue Jan 02, 2018 8:09 pm
I have a bad Work Unit 14004 which has been hung for the better part of a day. I've restarted it three times after it hung at 99.99%. While I'll be losing up to two days processing, I thought it important to restart it the third time and track it to see what is going on.
The log reinitializes each time the unit is restarted. Here is the entire log with the unit now at 99.08%:
*********************** Log Started 2018-01-02T17:55:17Z ***********************
17:55:17:************************* Folding@home Client *************************
17:55:17: Website: http://folding.stanford.edu/
17:55:17: Copyright: (c) 2009-2014 Stanford University
17:55:17: Author: Joseph Coffland <joseph@cauldrondevelopment.com>
17:55:17: Args: --open-web-control
17:55:17: Config: C:/Users/sinbad/AppData/Roaming/FAHClient/config.xml
17:55:17:******************************** Build ********************************
17:55:17: Version: 7.4.4
17:55:17: Date: Mar 4 2014
17:55:17: Time: 20:26:54
17:55:17: SVN Rev: 4130
17:55:17: Branch: fah/trunk/client
17:55:17: Compiler: Intel(R) C++ MSVC 1500 mode 1200
17:55:17: Options: /TP /nologo /EHa /Qdiag-disable:4297,4103,1786,279 /Ox -arch:SSE
17:55:17: /QaxSSE2,SSE3,SSSE3,SSE4.1,SSE4.2 /Qopenmp /Qrestrict /MT /Qmkl
17:55:17: Platform: win32 XP
17:55:17: Bits: 32
17:55:17: Mode: Release
17:55:17:******************************* System ********************************
17:55:17: CPU: AMD Phenom(tm) 9150e Quad-Core Processor
17:55:17: CPU ID: AuthenticAMD Family 16 Model 2 Stepping 3
17:55:17: CPUs: 4
17:55:17: Memory: 3.75GiB
17:55:17: Free Memory: 959.96MiB
17:55:17: Threads: WINDOWS_THREADS
17:55:17: OS Version: 6.0
17:55:17: Has Battery: false
17:55:17: On Battery: false
17:55:17: UTC Offset: -5
17:55:17: PID: 6504
17:55:17: CWD: C:/Users/sinbad/AppData/Roaming/FAHClient
17:55:17: OS: Windows (TM) Vista Home Premium
17:55:17: OS Arch: AMD64
17:55:17: GPUs: 1
17:55:17: GPU 0: UNSUPPORTED: [Radeon HD 3200]
17:55:17: CUDA: Not detected
17:55:17:Win32 Service: false
17:55:17:***********************************************************************
17:55:17:<config>
17:55:17: <!-- Folding Core -->
17:55:17: <core-priority v='low'/>
17:55:17:
17:55:17: <!-- Network -->
17:55:17: <proxy v=':8080'/>
17:55:17:
17:55:17: <!-- Slot Control -->
17:55:17: <pause-on-battery v='false'/>
17:55:17: <power v='full'/>
17:55:17:
17:55:17: <!-- User Information -->
17:55:17: <passkey v='********************************'/>
17:55:17: <team v='13915'/>
17:55:17: <user v='Spidermaster'/>
17:55:17:
17:55:17: <!-- Folding Slots -->
17:55:17: <slot id='0' type='CPU'/>
17:55:17:</config>
17:55:17:Trying to access database...
17:55:17:Successfully acquired database lock
17:55:17:Enabled folding slot 00: READY cpu:4
17:55:17:WU01:FS00:Starting
17:55:17:WU01:FS00:Running FahCore: "C:\Program Files (x86)\FAHClient/FAHCoreWrapper.exe" C:/Users/sinbad/AppData/Roaming/FAHClient/cores/fahwebx.stanford.edu/cores/Win32/AMD64/Core_a4.fah/FahCore_a4.exe -dir 01 -suffix 01 -version 704 -lifeline 6504 -checkpoint 15 -np 4
17:55:17:WU01:FS00:Started FahCore on PID 6856
17:55:17:WU01:FS00:Core PID:6080
17:55:17:WU01:FS00:FahCore 0xa4 started
17:55:17:WU01:FS00:0xa4:
17:55:17:WU01:FS00:0xa4:*------------------------------*
17:55:17:WU01:FS00:0xa4:Folding@Home Gromacs GB Core
17:55:17:WU01:FS00:0xa4:Version 2.27 (Dec. 15, 2010)
17:55:17:WU01:FS00:0xa4:
17:55:17:WU01:FS00:0xa4:Preparing to commence simulation
17:55:17:WU01:FS00:0xa4:- Looking at optimizations...
17:55:17:WU01:FS00:0xa4:- Files status OK
17:55:17:WU01:FS00:0xa4:- Expanded 740484 -> 1939364 (decompressed 261.9 percent)
17:55:17:WU01:FS00:0xa4:Called DecompressByteArray: compressed_data_size=740484 data_size=1939364, decompressed_data_size=1939364 diff=0
17:55:17:WU01:FS00:0xa4:- Digital signature verified
17:55:17:WU01:FS00:0xa4:
17:55:17:WU01:FS00:0xa4:Project: 14004 (Run 0, Clone 115, Gen 16)
17:55:17:WU01:FS00:0xa4:
17:55:18:WU01:FS00:0xa4:Assembly optimizations on if available.
17:55:18:WU01:FS00:0xa4:Entering M.D.
17:55:24:WU01:FS00:0xa4:Using Gromacs checkpoints
17:55:24:WU01:FS00:0xa4:Mapping NT from 4 to 4
17:55:25:15:127.0.0.1:New Web connection
17:59:39:WU01:FS00:0xa4:Resuming from checkpoint
17:59:39:WU01:FS00:0xa4:Verified 01/wudata_01.log
17:59:39:WU01:FS00:0xa4:Verified 01/wudata_01.trr
17:59:39:WU01:FS00:0xa4:Verified 01/wudata_01.xtc
17:59:39:WU01:FS00:0xa4:Verified 01/wudata_01.edr
18:00:26:WU01:FS00:0xa4:Completed 2358960 out of 2500000 steps (94%)
18:00:26:WARNING:WU01:FS00:Detected clock skew (5 mins 09 secs), adjusting time estimates
Please note that following the clock skew (origin unknown, as my computer clock has not been adjusted), there are NO MORE CHECKPOINTS.
If the unit hangs again, I will be forced to dump it in order to resume proper processing, in which case this log will be lost. I therefore hope this information is of use in diagnosing the problem.
The Spidermaster
P.S. I've emptied the work unit queue, but I saved all the associated files rather than deleting them. If you require these files for further analysis, please let me know.
The log reinitializes each time the unit is restarted. Here is the entire log with the unit now at 99.08%:
*********************** Log Started 2018-01-02T17:55:17Z ***********************
17:55:17:************************* Folding@home Client *************************
17:55:17: Website: http://folding.stanford.edu/
17:55:17: Copyright: (c) 2009-2014 Stanford University
17:55:17: Author: Joseph Coffland <joseph@cauldrondevelopment.com>
17:55:17: Args: --open-web-control
17:55:17: Config: C:/Users/sinbad/AppData/Roaming/FAHClient/config.xml
17:55:17:******************************** Build ********************************
17:55:17: Version: 7.4.4
17:55:17: Date: Mar 4 2014
17:55:17: Time: 20:26:54
17:55:17: SVN Rev: 4130
17:55:17: Branch: fah/trunk/client
17:55:17: Compiler: Intel(R) C++ MSVC 1500 mode 1200
17:55:17: Options: /TP /nologo /EHa /Qdiag-disable:4297,4103,1786,279 /Ox -arch:SSE
17:55:17: /QaxSSE2,SSE3,SSSE3,SSE4.1,SSE4.2 /Qopenmp /Qrestrict /MT /Qmkl
17:55:17: Platform: win32 XP
17:55:17: Bits: 32
17:55:17: Mode: Release
17:55:17:******************************* System ********************************
17:55:17: CPU: AMD Phenom(tm) 9150e Quad-Core Processor
17:55:17: CPU ID: AuthenticAMD Family 16 Model 2 Stepping 3
17:55:17: CPUs: 4
17:55:17: Memory: 3.75GiB
17:55:17: Free Memory: 959.96MiB
17:55:17: Threads: WINDOWS_THREADS
17:55:17: OS Version: 6.0
17:55:17: Has Battery: false
17:55:17: On Battery: false
17:55:17: UTC Offset: -5
17:55:17: PID: 6504
17:55:17: CWD: C:/Users/sinbad/AppData/Roaming/FAHClient
17:55:17: OS: Windows (TM) Vista Home Premium
17:55:17: OS Arch: AMD64
17:55:17: GPUs: 1
17:55:17: GPU 0: UNSUPPORTED: [Radeon HD 3200]
17:55:17: CUDA: Not detected
17:55:17:Win32 Service: false
17:55:17:***********************************************************************
17:55:17:<config>
17:55:17: <!-- Folding Core -->
17:55:17: <core-priority v='low'/>
17:55:17:
17:55:17: <!-- Network -->
17:55:17: <proxy v=':8080'/>
17:55:17:
17:55:17: <!-- Slot Control -->
17:55:17: <pause-on-battery v='false'/>
17:55:17: <power v='full'/>
17:55:17:
17:55:17: <!-- User Information -->
17:55:17: <passkey v='********************************'/>
17:55:17: <team v='13915'/>
17:55:17: <user v='Spidermaster'/>
17:55:17:
17:55:17: <!-- Folding Slots -->
17:55:17: <slot id='0' type='CPU'/>
17:55:17:</config>
17:55:17:Trying to access database...
17:55:17:Successfully acquired database lock
17:55:17:Enabled folding slot 00: READY cpu:4
17:55:17:WU01:FS00:Starting
17:55:17:WU01:FS00:Running FahCore: "C:\Program Files (x86)\FAHClient/FAHCoreWrapper.exe" C:/Users/sinbad/AppData/Roaming/FAHClient/cores/fahwebx.stanford.edu/cores/Win32/AMD64/Core_a4.fah/FahCore_a4.exe -dir 01 -suffix 01 -version 704 -lifeline 6504 -checkpoint 15 -np 4
17:55:17:WU01:FS00:Started FahCore on PID 6856
17:55:17:WU01:FS00:Core PID:6080
17:55:17:WU01:FS00:FahCore 0xa4 started
17:55:17:WU01:FS00:0xa4:
17:55:17:WU01:FS00:0xa4:*------------------------------*
17:55:17:WU01:FS00:0xa4:Folding@Home Gromacs GB Core
17:55:17:WU01:FS00:0xa4:Version 2.27 (Dec. 15, 2010)
17:55:17:WU01:FS00:0xa4:
17:55:17:WU01:FS00:0xa4:Preparing to commence simulation
17:55:17:WU01:FS00:0xa4:- Looking at optimizations...
17:55:17:WU01:FS00:0xa4:- Files status OK
17:55:17:WU01:FS00:0xa4:- Expanded 740484 -> 1939364 (decompressed 261.9 percent)
17:55:17:WU01:FS00:0xa4:Called DecompressByteArray: compressed_data_size=740484 data_size=1939364, decompressed_data_size=1939364 diff=0
17:55:17:WU01:FS00:0xa4:- Digital signature verified
17:55:17:WU01:FS00:0xa4:
17:55:17:WU01:FS00:0xa4:Project: 14004 (Run 0, Clone 115, Gen 16)
17:55:17:WU01:FS00:0xa4:
17:55:18:WU01:FS00:0xa4:Assembly optimizations on if available.
17:55:18:WU01:FS00:0xa4:Entering M.D.
17:55:24:WU01:FS00:0xa4:Using Gromacs checkpoints
17:55:24:WU01:FS00:0xa4:Mapping NT from 4 to 4
17:55:25:15:127.0.0.1:New Web connection
17:59:39:WU01:FS00:0xa4:Resuming from checkpoint
17:59:39:WU01:FS00:0xa4:Verified 01/wudata_01.log
17:59:39:WU01:FS00:0xa4:Verified 01/wudata_01.trr
17:59:39:WU01:FS00:0xa4:Verified 01/wudata_01.xtc
17:59:39:WU01:FS00:0xa4:Verified 01/wudata_01.edr
18:00:26:WU01:FS00:0xa4:Completed 2358960 out of 2500000 steps (94%)
18:00:26:WARNING:WU01:FS00:Detected clock skew (5 mins 09 secs), adjusting time estimates
Please note that following the clock skew (origin unknown, as my computer clock has not been adjusted), there are NO MORE CHECKPOINTS.
If the unit hangs again, I will be forced to dump it in order to resume proper processing, in which case this log will be lost. I therefore hope this information is of use in diagnosing the problem.
The Spidermaster
P.S. I've emptied the work unit queue, but I saved all the associated files rather than deleting them. If you require these files for further analysis, please let me know.