Page 3 of 3

Re: Fatal Error with WU

Posted: Fri Jun 12, 2020 2:29 pm
by _r2w_ben
jaos wrote:Another bad WU
Can you also include the following lines in your reports? (The first line will always be present but the second one is rare.)

Code: Select all

12:57:58:WU03:FS00:Requesting new work unit for slot 00: RUNNING cpu:21 from 128.252.203.4
01:30:04:WARNING:WU02:FS00:AS lowered CPUs from 21 to 18
Then it's clear how many CPUs the assignment server based it's decision on and whether it instructed the client to use less. Thanks!

Re: Fatal Error with WU

Posted: Sat Jun 13, 2020 9:18 pm
by uyaem
I had the same failure on the same project, but the 2nd line isn't present.
Should it be on default log level?

On the upside, the client discards the WU very quickly, so I'd guess it will be available to another user soon.
For the sake of completeness, here's my log (Translation of "Der Prozess kann nicht ..." = "The process cannot access the file because it is used by another process", but it does get cleaned up eventually):

Code: Select all

22:39:20:WU00:FS00:0xa7:*********************** Log Started 2020-06-09T22:39:19Z ***********************
22:39:20:WU00:FS00:0xa7:************************** Gromacs Folding@home Core ***************************
22:39:20:WU00:FS00:0xa7:       Type: 0xa7
22:39:20:WU00:FS00:0xa7:       Core: Gromacs
22:39:20:WU00:FS00:0xa7:       Args: -dir 00 -suffix 01 -version 706 -lifeline 15928 -checkpoint 15 -np
22:39:20:WU00:FS00:0xa7:             21
22:39:20:WU00:FS00:0xa7:************************************ CBang *************************************
22:39:20:WU00:FS00:0xa7:       Date: Oct 26 2019
22:39:20:WU00:FS00:0xa7:       Time: 01:38:25
22:39:20:WU00:FS00:0xa7:   Revision: c46a1a011a24143739ac7218c5a435f66777f62f
22:39:20:WU00:FS00:0xa7:     Branch: master
22:39:20:WU00:FS00:0xa7:   Compiler: Visual C++ 2008
22:39:20:WU00:FS00:0xa7:    Options: /TP /nologo /EHa /wd4297 /wd4103 /Ox /MT
22:39:20:WU00:FS00:0xa7:   Platform: win32 10
22:39:20:WU00:FS00:0xa7:       Bits: 64
22:39:20:WU00:FS00:0xa7:       Mode: Release
22:39:20:WU00:FS00:0xa7:************************************ System ************************************
22:39:20:WU00:FS00:0xa7:        CPU: AMD Ryzen 9 3900X 12-Core Processor
22:39:20:WU00:FS00:0xa7:     CPU ID: AuthenticAMD Family 23 Model 113 Stepping 0
22:39:20:WU00:FS00:0xa7:       CPUs: 24
22:39:20:WU00:FS00:0xa7:     Memory: 31.95GiB
22:39:20:WU00:FS00:0xa7:Free Memory: 19.42GiB
22:39:20:WU00:FS00:0xa7:    Threads: WINDOWS_THREADS
22:39:20:WU00:FS00:0xa7: OS Version: 6.2
22:39:20:WU00:FS00:0xa7:Has Battery: false
22:39:20:WU00:FS00:0xa7: On Battery: false
22:39:20:WU00:FS00:0xa7: UTC Offset: 2
22:39:20:WU00:FS00:0xa7:        PID: 21004
22:39:20:WU00:FS00:0xa7:        CWD: C:\Users\X\AppData\Roaming\FAHClient\work
22:39:20:WU00:FS00:0xa7:******************************** Build - libFAH ********************************
22:39:20:WU00:FS00:0xa7:    Version: 0.0.18
22:39:20:WU00:FS00:0xa7:     Author: Joseph Coffland <joseph@cauldrondevelopment.com>
22:39:20:WU00:FS00:0xa7:  Copyright: 2019 foldingathome.org
22:39:20:WU00:FS00:0xa7:   Homepage: https://foldingathome.org/
22:39:20:WU00:FS00:0xa7:       Date: Oct 26 2019
22:39:20:WU00:FS00:0xa7:       Time: 01:52:30
22:39:20:WU00:FS00:0xa7:   Revision: c1e3513b1bc0c16013668f2173ee969e5995b38e
22:39:20:WU00:FS00:0xa7:     Branch: master
22:39:20:WU00:FS00:0xa7:   Compiler: Visual C++ 2008
22:39:20:WU00:FS00:0xa7:    Options: /TP /nologo /EHa /wd4297 /wd4103 /Ox /MT
22:39:20:WU00:FS00:0xa7:   Platform: win32 10
22:39:20:WU00:FS00:0xa7:       Bits: 64
22:39:20:WU00:FS00:0xa7:       Mode: Release
22:39:20:WU00:FS00:0xa7:************************************ Build *************************************
22:39:20:WU00:FS00:0xa7:       SIMD: avx_256
22:39:20:WU00:FS00:0xa7:********************************************************************************
22:39:20:WU00:FS00:0xa7:Project: 14524 (Run 553, Clone 3, Gen 19)
22:39:20:WU00:FS00:0xa7:Unit: 0x0000001e80fccb0a5e781bdd6f4762b6
22:39:20:WU00:FS00:0xa7:Reading tar file core.xml
22:39:20:WU00:FS00:0xa7:Reading tar file frame19.tpr
22:39:20:WU00:FS00:0xa7:Digital signatures verified
22:39:20:WU00:FS00:0xa7:Calling: mdrun -s frame19.tpr -o frame19.trr -x frame19.xtc -cpt 15 -nt 21
22:39:20:WU00:FS00:0xa7:Steps: first=4750000 total=250000
22:39:20:WU00:FS00:0xa7:ERROR:
22:39:20:WU00:FS00:0xa7:ERROR:-------------------------------------------------------
22:39:20:WU00:FS00:0xa7:ERROR:Program GROMACS, VERSION 5.0.4-20191026-456f0d636-unknown
22:39:20:WU00:FS00:0xa7:ERROR:Source code file: C:\build\fah\core-a7-avx-release\windows-10-64bit-core-a7-avx-release\gromacs-core\build\gromacs\src\gromacs\mdlib\domdec.c, line: 6902
22:39:20:WU00:FS00:0xa7:ERROR:
22:39:20:WU00:FS00:0xa7:ERROR:Fatal error:
22:39:20:WU00:FS00:0xa7:ERROR:There is no domain decomposition for 16 ranks that is compatible with the given box and a minimum cell size of 1.4227 nm
22:39:20:WU00:FS00:0xa7:ERROR:Change the number of ranks or mdrun option -rcon or -dds or your LINCS settings
22:39:20:WU00:FS00:0xa7:ERROR:Look in the log file for details on the domain decomposition
22:39:20:WU00:FS00:0xa7:ERROR:For more information and tips for troubleshooting, please check the GROMACS
22:39:20:WU00:FS00:0xa7:ERROR:website at http://www.gromacs.org/Documentation/Errors
22:39:20:WU00:FS00:0xa7:ERROR:-------------------------------------------------------
22:39:24:WU00:FS00:0xa7:WARNING:Unexpected exit() call
22:39:24:WU00:FS00:0xa7:WARNING:Unexpected exit from science code
22:39:24:WU00:FS00:0xa7:Saving result file ..\logfile_01.txt
22:39:24:WU00:FS00:0xa7:Saving result file md.log
22:39:24:WU00:FS00:0xa7:Saving result file science.log
22:39:24:WU00:FS00:0xa7:WARNING:While cleaning up: boost::filesystem::remove: Der Prozess kann nicht auf die Datei zugreifen, da sie von einem anderen Prozess verwendet wird: "01/md.log"
22:39:24:WU00:FS00:0xa7:Folding@home Core Shutdown: BAD_WORK_UNIT
22:39:25:WU02:FS00:Upload 42.28%
22:39:25:WARNING:WU00:FS00:FahCore returned: BAD_WORK_UNIT (114 = 0x72)
22:39:25:WU00:FS00:Sending unit results: id:00 state:SEND error:FAULTY project:14524 run:553 clone:3 gen:19 core:0xa7 unit:0x0000001e80fccb0a5e781bdd6f4762b6

Re: Fatal Error with WU

Posted: Sat Jun 13, 2020 9:29 pm
by PantherX
The line about AS lowered CPUs will be present in your log file at a default logging level.

Can you please show the log file where it requested that WU from the Server?

Re: Fatal Error with WU

Posted: Sat Jun 13, 2020 9:31 pm
by _r2w_ben
uyaem wrote:I had the same failure on the same project, but the 2nd line isn't present.
Should it be on default log level?
Since this project is not supposed to be assigned to 21 threads, the second line being present would mean that the servers realised that and said, "You can have this work unit, but please run it on 18 threads because that number is allowed." If it was working as expected, the message would be there at the default log level.

Re: Fatal Error with WU

Posted: Sat Jun 13, 2020 9:44 pm
by uyaem
Posting everything up to the line posted previously. I'm keeping it complete, even if it contains the download of the next WU, just so we don't miss anything.
Please keep in mind that this log is a few days old... :)

Code: Select all

22:38:48:WU00:FS00:Connecting to assign1.foldingathome.org:80
22:38:48:WU00:FS00:Assigned to work server 128.252.203.10
22:38:48:WU00:FS00:Requesting new work unit for slot 00: RUNNING cpu:21 from 128.252.203.10
22:38:48:WU00:FS00:Connecting to 128.252.203.10:8080
22:38:49:WU00:FS00:Downloading 1.06MiB
22:38:50:WU00:FS00:Download complete
22:38:50:WU00:FS00:Received Unit: id:00 state:DOWNLOAD error:NO_ERROR project:14524 run:553 clone:3 gen:19 core:0xa7 unit:0x0000001e80fccb0a5e781bdd6f4762b6
22:39:17:WU02:FS00:0xa7:Completed 250000 out of 250000 steps (100%)
22:39:19:WU02:FS00:0xa7:Saving result file ..\logfile_01.txt
22:39:19:WU02:FS00:0xa7:Saving result file dhdl.xvg
22:39:19:WU02:FS00:0xa7:Saving result file frame101.trr
22:39:19:WU02:FS00:0xa7:Saving result file md.log
22:39:19:WU02:FS00:0xa7:Saving result file pullf.xvg
22:39:19:WU02:FS00:0xa7:Saving result file pullx.xvg
22:39:19:WU02:FS00:0xa7:Saving result file science.log
22:39:19:WU02:FS00:0xa7:Saving result file traj_comp.xtc
22:39:19:WU02:FS00:0xa7:Folding@home Core Shutdown: FINISHED_UNIT
22:39:19:WU02:FS00:FahCore returned: FINISHED_UNIT (100 = 0x64)
22:39:19:WU02:FS00:Sending unit results: id:02 state:SEND error:NO_ERROR project:14722 run:38 clone:0 gen:101 core:0xa7 unit:0x0000007c9bf7a4d65ea0712cb6852f9b
22:39:19:WU02:FS00:Uploading 6.80MiB to 155.247.164.214
22:39:19:WU00:FS00:Starting
22:39:19:WU02:FS00:Connecting to 155.247.164.214:8080
22:39:19:WU00:FS00:Running FahCore: "C:\Program Files (x86)\FAHClient/FAHCoreWrapper.exe" C:\Users\X\AppData\Roaming\FAHClient\cores/cores.foldingathome.org/v7/win/64bit/avx/Core_a7.fah/FahCore_a7.exe -dir 00 -suffix 01 -version 706 -lifeline 8288 -checkpoint 15 -np 21
22:39:19:WU00:FS00:Started FahCore on PID 15928
22:39:19:WU00:FS00:Core PID:21004
22:39:19:WU00:FS00:FahCore 0xa7 started
22:39:20:WU00:FS00:0xa7:*********************** Log Started 2020-06-09T22:39:19Z ***********************

Re: Fatal Error with WU

Posted: Sun Jun 14, 2020 7:55 am
by PantherX
Thanks for the log file, uyaem. I have notified the researcher about this.

FYI, I personally use this:
<next-unit-percentage v='100'/>
Since I don't want to wait for 1% of the WU to be over before starting the downloaded WU. Thus, I do gain a tiny amount of points :)

Re: Fatal Error with WU

Posted: Sun Jun 14, 2020 1:57 pm
by uyaem
PantherX wrote:Thanks for the log file, uyaem. I have notified the researcher about this.

FYI, I personally use this:
<next-unit-percentage v='100'/>
Since I don't want to wait for 1% of the WU to be over before starting the downloaded WU. Thus, I do gain a tiny amount of points :)
Maybe I shouldn't be asking this, but are you sure this is correct?

Code: Select all

  next-unit-percentage <integer=99>
    Pre-download the next work unit when the current one is this far along.
The way I understand this is that the next WU will be downloaded once the current one is at X percent.
Wouldn't a setting of 100 mean that the download only starts after the current one completes?
I don't understand how this would gain you extra points.

Re: Fatal Error with WU

Posted: Sun Jun 14, 2020 3:29 pm
by Joe_H
uyaem wrote:The way I understand this is that the next WU will be downloaded once the current one is at X percent.
Wouldn't a setting of 100 mean that the download only starts after the current one completes?
I don't understand how this would gain you extra points.
There are two parts to this. The bonus is based on the download and upload times for the WU. Depending on the TPF for the current WU, a download at 99% can be sitting on your computer waiting for 1 minute, or 20 minutes. That reduces the bonus slightly.

When a WU gets to 100% there is still some post-processing to be completed before it gets sent in, but the download starts right then. On anything but a slow internet connection most WUs can download before the post-processing finishes and start immediately.

You can see this in log file entries, 100% is reached, a download starts, there will be some messages about the WU just completed being prepared to be returned, and then the folding core exits. Depending on download size it should also show as completing somewhere in that message sequence, and as soon as the folding core exits for the just completed WU, the new one will be started.

Personally I leave the setting at the default of 99%. I am on a DSL connection, so some of the time a WU download has not finished before the upload starts. Both an upload and a download happening at the same time has a very negative effect on my connection.

Re: Fatal Error with WU

Posted: Sun Jun 14, 2020 3:47 pm
by uyaem
Ah of course, gotcha. :)

EDIT: If only it wasn't an integer value, I'd be min/maxing to 99.8 ;)

Re: Fatal Error with WU

Posted: Sun Jun 14, 2020 4:21 pm
by bruce
WUs come in different sizes, which means the data consolidation step can take a varying amount of time. In the log above, it took from 22:39:17 to 22:39:19 ... 2 seconds. From what I'm reading about unreleased cores, the sizes of the upload package and the download package are growing though there still are plenty of small WUs. They're adding compression to the sequence which will add a certain amount of time to that 2 seconds.

The download of the new WU took from 22:38:48 to 22:39:19 or 31 seconds. so in this case, it might have cost you another 28 seconds. Hardly worth worrying about, either way, given that the processing generally takes many hours.

Re: Fatal Error with WU

Posted: Sun Jun 14, 2020 4:47 pm
by Rel25917
Next unit percent was much more useful before everyone had super fast broadband connections, can still be useful for some people.