Page 2 of 3

Re: _a7 core crashing in Gromacs

Posted: Sat Apr 11, 2020 8:38 pm
by _r2w_ben
kyleedwardsny wrote:Alright, so, is there anything I can do to skip over this work unit, or otherwise get it to work? It's been crashing over and over since yesterday afternoon and keeping my computer idle :(
Follow JimboPalmer's post and lower the number of CPUs assigned to the slot.
Set it 23, click OK, click Save and then check the log to see if started successfully.
Repeat lowering by 1 until it works.

After the work unit finishes you can set the CPUs back to 24.

Re: _a7 core crashing in Gromacs

Posted: Sat Apr 11, 2020 9:17 pm
by kyleedwardsny
This is on a Linux server. There is no graphical application for me to click buttons. I'm guessing I need to edit a config file instead?

Re: _a7 core crashing in Gromacs

Posted: Sat Apr 11, 2020 9:53 pm
by Joe_H
Answer was given in first response, change the number of CPU threads it tries to run on. You can move the slider Medium or Light if you have left control to that, or change the number in FAHControl's Configure. I would suggest 18 or 16 threads as a starting place.

Re: _a7 core crashing in Gromacs

Posted: Sat Apr 11, 2020 9:55 pm
by _r2w_ben
kyleedwardsny wrote:This is on a Linux server. There is no graphical application for me to click buttons. I'm guessing I need to edit a config file instead?
Yes, config.xml should be located in /etc/fahclient/

Change

Code: Select all

<slot id='0' type='CPU'/>
to

Code: Select all

<slot id='0' type='CPU'>
    <cpus v='23' />
</slot>

Re: _a7 core crashing in Gromacs

Posted: Sat Apr 11, 2020 10:04 pm
by PantherX
kyleedwardsny wrote:Hi PantherX, sorry about the log file confusion on my part. Here is the system configuration portion of /config/log...
That's all good, the important thing to remember is we got there in the end :)

I have noticed that you're not using a passkey. While it is recommended for security and bonus points, it's optional so have a read here and then make a decision: https://foldingathome.org/support/faq/points/passkey/

Re: _a7 core crashing in Gromacs

Posted: Sat Apr 11, 2020 10:16 pm
by kyleedwardsny
_r2w_ben wrote:Yes, config.xml should be located in /etc/fahclient/

Change

Code: Select all

<slot id='0' type='CPU'/>
to

Code: Select all

<slot id='0' type='CPU'>
    <cpus v='23' />
</slot>
Thank you _r2w_ben, this was exactly the information I needed! My client is now churning through the work unit with 16 out of 24 cores and has not crashed.

Re: _a7 core crashing in Gromacs

Posted: Sat Apr 11, 2020 10:32 pm
by kyleedwardsny
PantherX wrote:I have noticed that you're not using a passkey.
Thanks PantherX, I have heeded your advice and generated a passkey.

Re: _a7 core crashing in Gromacs

Posted: Sun Apr 12, 2020 12:09 am
by PantherX
FYI, I have confirmation from the Project owner that Project 16417 will no longer be assigned to 24 CPUs. Thanks all for your report :)

Re: _a7 core crashing in Gromacs

Posted: Sun Apr 12, 2020 3:45 am
by kyleedwardsny
Excellent! I will consider this issue closed then.

Re: _a7 core crashing in Gromacs

Posted: Wed Apr 15, 2020 4:12 pm
by m1geo
PantherX wrote:FYI, I have confirmation from the Project owner that Project 16417 will no longer be assigned to 24 CPUs. Thanks all for your report :)
Just received the same from project 16403. 24 CPUs. I've changed to 23 now, and it's working. I guess I need to put a GPU in here to make use of the "spare" CPU ;)

AMD Ryzen 9 3900X, 64 GB RAM, No GPU, Ubuntu Linux 19.10 x64.

Error below:

Code: Select all

15:58:00:WU00:FS00:Starting
15:58:00:WU00:FS00:Removing old file './work/00/logfile_01-20200415-152559.txt'
15:58:00:WU00:FS00:Running FahCore: /usr/bin/FAHCoreWrapper /var/lib/fahclient/cores/cores.foldingathome.org/v7/lin/64bit/avx/Core_a7.fah/FahCore_a7 -dir 00 -suffix 01 -version 705 -lifeline 1639 -checkpoint 15 -np 24
15:58:00:WU00:FS00:Started FahCore on PID 7868
15:58:00:WU00:FS00:Core PID:7872
15:58:00:WU00:FS00:FahCore 0xa7 started
15:58:00:WU00:FS00:0xa7:*********************** Log Started 2020-04-15T15:58:00Z ***********************
15:58:00:WU00:FS00:0xa7:************************** Gromacs Folding@home Core ***************************
15:58:00:WU00:FS00:0xa7:       Type: 0xa7
15:58:00:WU00:FS00:0xa7:       Core: Gromacs
15:58:00:WU00:FS00:0xa7:       Args: -dir 00 -suffix 01 -version 705 -lifeline 7868 -checkpoint 15 -np
15:58:00:WU00:FS00:0xa7:             24
15:58:00:WU00:FS00:0xa7:************************************ CBang *************************************
15:58:00:WU00:FS00:0xa7:       Date: Nov 5 2019
15:58:00:WU00:FS00:0xa7:       Time: 06:06:57
15:58:00:WU00:FS00:0xa7:   Revision: 46c96f1aa8419571d83f3e63f9c99a0d602f6da9
15:58:00:WU00:FS00:0xa7:     Branch: master
15:58:00:WU00:FS00:0xa7:   Compiler: GNU 8.3.0
15:58:00:WU00:FS00:0xa7:    Options: -std=c++11 -O3 -funroll-loops -fno-pie -fPIC
15:58:00:WU00:FS00:0xa7:   Platform: linux2 4.19.0-5-amd64
15:58:00:WU00:FS00:0xa7:       Bits: 64
15:58:00:WU00:FS00:0xa7:       Mode: Release
15:58:00:WU00:FS00:0xa7:************************************ System ************************************
15:58:00:WU00:FS00:0xa7:        CPU: AMD Ryzen 9 3900X 12-Core Processor
15:58:00:WU00:FS00:0xa7:     CPU ID: AuthenticAMD Family 23 Model 113 Stepping 0
15:58:00:WU00:FS00:0xa7:       CPUs: 24
15:58:00:WU00:FS00:0xa7:     Memory: 62.79GiB
15:58:00:WU00:FS00:0xa7:Free Memory: 61.72GiB
15:58:00:WU00:FS00:0xa7:    Threads: POSIX_THREADS
15:58:00:WU00:FS00:0xa7: OS Version: 5.3
15:58:00:WU00:FS00:0xa7:Has Battery: false
15:58:00:WU00:FS00:0xa7: On Battery: false
15:58:00:WU00:FS00:0xa7: UTC Offset: 1
15:58:00:WU00:FS00:0xa7:        PID: 7872
15:58:00:WU00:FS00:0xa7:        CWD: /var/lib/fahclient/work
15:58:00:WU00:FS00:0xa7:******************************** Build - libFAH ********************************
15:58:00:WU00:FS00:0xa7:    Version: 0.0.18
15:58:00:WU00:FS00:0xa7:     Author: Joseph Coffland <joseph@cauldrondevelopment.com>
15:58:00:WU00:FS00:0xa7:  Copyright: 2019 foldingathome.org
15:58:00:WU00:FS00:0xa7:   Homepage: https://foldingathome.org/
15:58:00:WU00:FS00:0xa7:       Date: Nov 5 2019
15:58:00:WU00:FS00:0xa7:       Time: 06:13:26
15:58:00:WU00:FS00:0xa7:   Revision: 490c9aa2957b725af319379424d5c5cb36efb656
15:58:00:WU00:FS00:0xa7:     Branch: master
15:58:00:WU00:FS00:0xa7:   Compiler: GNU 8.3.0
15:58:00:WU00:FS00:0xa7:    Options: -std=c++11 -O3 -funroll-loops -fno-pie
15:58:00:WU00:FS00:0xa7:   Platform: linux2 4.19.0-5-amd64
15:58:00:WU00:FS00:0xa7:       Bits: 64
15:58:00:WU00:FS00:0xa7:       Mode: Release
15:58:00:WU00:FS00:0xa7:************************************ Build *************************************
15:58:00:WU00:FS00:0xa7:       SIMD: avx_256
15:58:00:WU00:FS00:0xa7:********************************************************************************
15:58:00:WU00:FS00:0xa7:Project: 16403 (Run 1069, Clone 0, Gen 34)
15:58:00:WU00:FS00:0xa7:Unit: 0x0000002696880e6e5e8be09915f36d54
15:58:00:WU00:FS00:0xa7:Reading tar file core.xml
15:58:00:WU00:FS00:0xa7:Reading tar file frame34.tpr
15:58:00:WU00:FS00:0xa7:Digital signatures verified
15:58:00:WU00:FS00:0xa7:Calling: mdrun -s frame34.tpr -o frame34.trr -x frame34.xtc -cpt 15 -nt 24
15:58:00:WU00:FS00:0xa7:Steps: first=17000000 total=500000
15:58:00:WU00:FS00:0xa7:ERROR:
15:58:00:WU00:FS00:0xa7:ERROR:-------------------------------------------------------
15:58:00:WU00:FS00:0xa7:ERROR:Program GROMACS, VERSION 5.0.4-20191026-456f0d636-unknown
15:58:00:WU00:FS00:0xa7:ERROR:Source code file: /host/debian-stable-64bit-core-a7-avx-release/gromacs-core/build/gromacs/src/gromacs/mdlib/domdec.c, line: 6902
15:58:00:WU00:FS00:0xa7:ERROR:
15:58:00:WU00:FS00:0xa7:ERROR:Fatal error:
15:58:00:WU00:FS00:0xa7:ERROR:There is no domain decomposition for 20 ranks that is compatible with the given box and a minimum cell size of 1.45733 nm
15:58:00:WU00:FS00:0xa7:ERROR:Change the number of ranks or mdrun option -rcon or -dds or your LINCS settings
15:58:00:WU00:FS00:0xa7:ERROR:Look in the log file for details on the domain decomposition
15:58:00:WU00:FS00:0xa7:ERROR:For more information and tips for troubleshooting, please check the GROMACS
15:58:00:WU00:FS00:0xa7:ERROR:website at http://www.gromacs.org/Documentation/Errors
15:58:00:WU00:FS00:0xa7:ERROR:-------------------------------------------------------
15:58:05:WU00:FS00:0xa7:WARNING:Unexpected exit() call
15:58:05:WU00:FS00:0xa7:WARNING:Unexpected exit from science code
15:58:05:WU00:FS00:0xa7:Saving result file ../logfile_01.txt
15:58:05:WU00:FS00:0xa7:Saving result file md.log
15:58:05:WU00:FS00:0xa7:Saving result file science.log
15:58:05:WU00:FS00:FahCore returned: INTERRUPTED (102 = 0x66)
Thanks

Re: _a7 core crashing in Gromacs

Posted: Wed Apr 15, 2020 4:28 pm
by bruce
You can reduce the number of CPUs used to some number that works. The curren version of gromacs is allocating some threads to the PME processing and we're still figuring out the details and what to do about it. All 24 cores are being used. Four are being allocated to PME and 20 are being allocated to the domain processing which is not ideal.

In config.xml you'll find something like this:
<slot type="CPU" id="0">
<cpus v="16"/>
</slot>

Adding the cpus entry will allow you to control the setting. 18 might work, or 16, or 12.

When we figure out about PME, we'll get back to you. 24 should work, but it's not.

If you want maximum utilization right now, the earlier suggestion of using two slots, one of 16 and one of 8 is a good one.

That would look something like this:

<slot type="CPU" id="0">
<cpus v="16"/>
</slot><slot type="CPU" id="1">
<cpus v="8"/>
</slot>

(off-topic and just for my information) is it impossible to make FAHControl work in your environment? Would a text-based editor be better?

Re: _a7 core crashing in Gromacs

Posted: Wed Apr 15, 2020 4:34 pm
by Neil-B
For the most part 24core works fine so if you have a 24core slot don't feel you have to change it down (unless of course it is to get an issue like this one cleared) - occasionally a project comes along that has issues, but hopefully these get picked up in beta/advanced and they are simply not assigned to 24core slot ... On advanced 24core was kept working pretty much 7/24 over the last months leaner period - but I may just have been lucky - I'll check back but fairly sure I had no failures (might have been one).

Re: _a7 core crashing in Gromacs

Posted: Wed Apr 15, 2020 6:16 pm
by m1geo
I'm not the OP. I was just reporting this was still ongoing, the previous last post said it was resolved.

I switched it down to 23 CPUs for the time being and that's working fine.

Keep up the good work.

Re: _a7 core crashing in Gromacs

Posted: Wed Apr 15, 2020 8:32 pm
by PantherX
m1geo wrote:
PantherX wrote:FYI, I have confirmation from the Project owner that Project 16417 will no longer be assigned to 24 CPUs. Thanks all for your report :)
Just received the same from project 16403. 24 CPUs. I've changed to 23 now, and it's working. I guess I need to put a GPU in here to make use of the "spare" CPU ;)...
Please note Project 16403 is different that Project 16417. However, I have informed the researcher to let's see what happens :)

Re: _a7 core crashing in Gromacs

Posted: Wed Apr 15, 2020 10:37 pm
by PantherX
FYI, the researcher has decided to err on the side of caution and have prevented 24 CPUs from receiving Project 16403.