Page 1 of 1

WU has reset overnight

Posted: Mon Mar 27, 2017 6:29 am
by Redamancy
Hello!
I've recently started folding at a decent Acer laptop that I didn't use for anything anyway. Folding has gone well, until yesterday when I, just as usual, paused the folding and shut off my computer. Normally, when I start it again next morning, it comes back to where it stopped last night and keeps folding as usual. But when I started it today the entire WU had reset. It's a shame since this was a big WU worth 3300 points (for this laptop, it's big :P) and it had worked about 50 % through it. Is this a common thing? I would like to know if there's any solution.
Log will be posted below:

Code: Select all

*********************** Log Started 2017-03-27T06:18:12Z ***********************
06:18:12:************************* Folding@home Client *************************
06:18:12:      Website: http://folding.stanford.edu/
06:18:12:    Copyright: (c) 2009-2014 Stanford University
06:18:12:       Author: Joseph Coffland <joseph@cauldrondevelopment.com>
06:18:12:         Args: 
06:18:12:       Config: C:/Users/Jakob/AppData/Roaming/FAHClient/config.xml
06:18:12:******************************** Build ********************************
06:18:12:      Version: 7.4.4
06:18:12:         Date: Mar 4 2014
06:18:12:         Time: 20:26:54
06:18:12:      SVN Rev: 4130
06:18:12:       Branch: fah/trunk/client
06:18:12:     Compiler: Intel(R) C++ MSVC 1500 mode 1200
06:18:12:      Options: /TP /nologo /EHa /Qdiag-disable:4297,4103,1786,279 /Ox -arch:SSE
06:18:12:               /QaxSSE2,SSE3,SSSE3,SSE4.1,SSE4.2 /Qopenmp /Qrestrict /MT /Qmkl
06:18:12:     Platform: win32 XP
06:18:12:         Bits: 32
06:18:12:         Mode: Release
06:18:12:******************************* System ********************************
06:18:12:          CPU: Intel(R) Core(TM) i3-3227U CPU @ 1.90GHz
06:18:12:       CPU ID: GenuineIntel Family 6 Model 58 Stepping 9
06:18:12:         CPUs: 4
06:18:12:       Memory: 3.82GiB
06:18:12:  Free Memory: 2.56GiB
06:18:12:      Threads: WINDOWS_THREADS
06:18:12:   OS Version: 6.1
06:18:12:  Has Battery: true
06:18:12:   On Battery: true
06:18:12:   UTC Offset: 2
06:18:12:          PID: 3564
06:18:12:          CWD: C:/Users/Jakob/AppData/Roaming/FAHClient
06:18:12:           OS: Windows 7 Professional
06:18:12:      OS Arch: AMD64
06:18:12:         GPUs: 0
06:18:12:         CUDA: Not detected
06:18:12:Win32 Service: false
06:18:12:***********************************************************************
06:18:12:<config>
06:18:12:  <!-- Network -->
06:18:12:  <proxy v=':8080'/>
06:18:12:
06:18:12:  <!-- Slot Control -->
06:18:12:  <pause-on-battery v='false'/>
06:18:12:  <power v='full'/>
06:18:12:
06:18:12:  <!-- User Information -->
06:18:12:  <team v='143016'/>
06:18:12:  <user v='Redamancy'/>
06:18:12:
06:18:12:  <!-- Folding Slots -->
06:18:12:  <slot id='0' type='CPU'/>
06:18:12:</config>
06:18:12:Trying to access database...
06:18:15:Successfully acquired database lock
06:18:15:Enabled folding slot 00: READY cpu:4
06:18:15:WU00:FS00:Starting
06:18:15:WU00:FS00:Running FahCore: "C:\Program Files (x86)\FAHClient/FAHCoreWrapper.exe" C:/Users/Jakob/AppData/Roaming/FAHClient/cores/fahwebx.stanford.edu/cores/Win32/AMD64/Core_a4.fah/FahCore_a4.exe -dir 00 -suffix 01 -version 704 -lifeline 3564 -checkpoint 15 -np 4
06:18:21:WU00:FS00:Started FahCore on PID 4776
06:18:21:WU00:FS00:Core PID:4788
06:18:21:WU00:FS00:FahCore 0xa4 started
06:18:23:WU00:FS00:0xa4:
06:18:23:WU00:FS00:0xa4:*------------------------------*
06:18:23:WU00:FS00:0xa4:Folding@Home Gromacs GB Core
06:18:23:WU00:FS00:0xa4:Version 2.27 (Dec. 15, 2010)
06:18:23:WU00:FS00:0xa4:
06:18:23:WU00:FS00:0xa4:Preparing to commence simulation
06:18:23:WU00:FS00:0xa4:- Looking at optimizations...
06:18:23:WU00:FS00:0xa4:- Files status OK
06:18:24:WU00:FS00:0xa4:- Expanded 1948126 -> 6261824 (decompressed 321.4 percent)
06:18:24:WU00:FS00:0xa4:Called DecompressByteArray: compressed_data_size=1948126 data_size=6261824, decompressed_data_size=6261824 diff=0
06:18:24:WU00:FS00:0xa4:- Digital signature verified
06:18:24:WU00:FS00:0xa4:
06:18:24:WU00:FS00:0xa4:Project: 11660 (Run 91, Clone 3, Gen 93)
06:18:24:WU00:FS00:0xa4:
06:18:24:WU00:FS00:0xa4:Assembly optimizations on if available.
06:18:24:WU00:FS00:0xa4:Entering M.D.
06:18:30:WU00:FS00:0xa4:Using Gromacs checkpoints
06:18:30:WU00:FS00:0xa4:Mapping NT from 4 to 4 
06:18:36:WU00:FS00:0xa4:Resuming from checkpoint
06:18:36:WU00:FS00:0xa4:Verified 00/wudata_01.log
06:18:37:WU00:FS00:0xa4:Verified 00/wudata_01.trr
06:18:38:WU00:FS00:0xa4:File 00/wudata_01.xtc has changed since last checkpoint
06:18:38:WU00:FS00:0xa4:mdrun returned 3
06:18:40:WU00:FS00:0xa4:Gromacs detected an invalid checkpoint.  Restarting...
06:18:40:WU00:FS00:0xa4:Folding@home Core Shutdown: UNKNOWN_ERROR
06:18:41:WARNING:WU00:FS00:FahCore returned: CORE_RESTART (98 = 0x62)
06:18:41:WU00:FS00:Starting
06:18:41:WU00:FS00:Running FahCore: "C:\Program Files (x86)\FAHClient/FAHCoreWrapper.exe" C:/Users/Jakob/AppData/Roaming/FAHClient/cores/fahwebx.stanford.edu/cores/Win32/AMD64/Core_a4.fah/FahCore_a4.exe -dir 00 -suffix 01 -version 704 -lifeline 3564 -checkpoint 15 -np 4
06:18:41:WU00:FS00:Started FahCore on PID 3732
06:18:41:WU00:FS00:Core PID:2688
06:18:41:WU00:FS00:FahCore 0xa4 started
06:18:41:WU00:FS00:0xa4:
06:18:41:WU00:FS00:0xa4:*------------------------------*
06:18:41:WU00:FS00:0xa4:Folding@Home Gromacs GB Core
06:18:41:WU00:FS00:0xa4:Version 2.27 (Dec. 15, 2010)
06:18:41:WU00:FS00:0xa4:
06:18:41:WU00:FS00:0xa4:Preparing to commence simulation
06:18:41:WU00:FS00:0xa4:- Looking at optimizations...
06:18:42:WU00:FS00:0xa4:- Created dyn
06:18:42:WU00:FS00:0xa4:- Files status OK
06:18:42:WU00:FS00:0xa4:- Expanded 1948126 -> 6261824 (decompressed 321.4 percent)
06:18:42:WU00:FS00:0xa4:Called DecompressByteArray: compressed_data_size=1948126 data_size=6261824, decompressed_data_size=6261824 diff=0
06:18:42:WU00:FS00:0xa4:- Digital signature verified
06:18:42:WU00:FS00:0xa4:
06:18:42:WU00:FS00:0xa4:Project: 11660 (Run 91, Clone 3, Gen 93)
06:18:42:WU00:FS00:0xa4:
06:18:42:WU00:FS00:0xa4:Assembly optimizations on if available.
06:18:42:WU00:FS00:0xa4:Entering M.D.
06:18:48:WU00:FS00:0xa4:Mapping NT from 4 to 4 
06:18:53:WU00:FS00:0xa4:Completed 0 out of 1250000 steps  (0%)

Re: WU has reset overnight

Posted: Mon Mar 27, 2017 7:25 am
by Redamancy
I noticed this while enabling errors and warnings in my log.

*********************** Log Started 2017-03-27T06:18:12Z ***********************
06:18:41:WARNING:WU00:FS00:FahCore returned: CORE_RESTART (98 = 0x62)

That could probably be the thing that reset the core. Anyone knows what the cause could be?

Re: WU has reset overnight

Posted: Mon Mar 27, 2017 7:50 am
by Joe_H
Welcome to the folding support forum.

The important part of the message is just a couple lines before:

Code: Select all

06:18:38:WU00:FS00:0xa4:File 00/wudata_01.xtc has changed since last checkpoint
06:18:38:WU00:FS00:0xa4:mdrun returned 3
06:18:40:WU00:FS00:0xa4:Gromacs detected an invalid checkpoint.  Restarting...
06:18:40:WU00:FS00:0xa4:Folding@home Core Shutdown: UNKNOWN_ERROR
Something corrupted part of the checkpoint done prior to your shutting down the laptop. It could be as simple as not providing enough time between pausing folding and doing the shutdown, the checkpoint files might not have been completely written to disk first. Windows is supposed to wait long enough for this to happen during a shutdown, but from personal experience that does not always happen. In this case, without a valid checkpoint to start from, the client restarted from the beginning.

There are some other possible causes for the file getting corrupted, for example a failing drive. The best way to avoid this happening in the future is to pause folding and wait a minute or two before shutting down.

Re: WU has reset overnight

Posted: Mon Mar 27, 2017 8:19 am
by Redamancy
Joe_H wrote:Welcome to the folding support forum.

The important part of the message is just a couple lines before:

Code: Select all

06:18:38:WU00:FS00:0xa4:File 00/wudata_01.xtc has changed since last checkpoint
06:18:38:WU00:FS00:0xa4:mdrun returned 3
06:18:40:WU00:FS00:0xa4:Gromacs detected an invalid checkpoint.  Restarting...
06:18:40:WU00:FS00:0xa4:Folding@home Core Shutdown: UNKNOWN_ERROR
Something corrupted part of the checkpoint done prior to your shutting down the laptop. It could be as simple as not providing enough time between pausing folding and doing the shutdown, the checkpoint files might not have been completely written to disk first. Windows is supposed to wait long enough for this to happen during a shutdown, but from personal experience that does not always happen. In this case, without a valid checkpoint to start from, the client restarted from the beginning.

There are some other possible causes for the file getting corrupted, for example a failing drive. The best way to avoid this happening in the future is to pause folding and wait a minute or two before shutting down.
Drats, I realised the cause was something like this. I was tired and could possibly have shut off the laptop too quick.
Thanks for the reply.

Re: WU has reset overnight

Posted: Mon Mar 27, 2017 1:05 pm
by SteveWillis
Does FAHControl also write a checkpoint when paused? I was under the impression that there was no way to force a checkpoint.

Re: WU has reset overnight

Posted: Mon Mar 27, 2017 2:53 pm
by Joe_H
With the GPU folding cores currently in use that is correct, they write a checkpoint every so many steps. For the CPU cores the checkpoint is done at the time interval set through FAHControl, but that checkpoint will e done as soon as the current iteration completes. However by bad timing, a shutdown done at the same moment as the checkpoint is in the process of being created can result in it being corrupted by being partially written.

Re: WU has reset overnight

Posted: Sat Apr 01, 2017 5:49 pm
by bruce
SteveWillis wrote:Does FAHControl also write a checkpoint when paused? I was under the impression that there was no way to force a checkpoint.
As Joe suggested, FAHControl does not initiate a checkpoint when paused but it should complete one if the checkpoint process has started. Then, too, once it has been written, it's still in cache memory and it's up to the OS to complete the process of storing the data permanently on disk.

And, no, there is no way to force a checkpoint, so upon restart, FAH will have to reprocess whatever work has been done since the last checkpoint was written.