Page 1 of 1

Project: 14531 (Run 0, Clone 1305, Gen 17)

Posted: Thu Apr 16, 2020 8:39 am
by AkrionXxarr
This WU appears to have stalled out on one of my linux servers. You have a lot of dead links in your "Troubleshooting WUs" sticky (such as links to a non-existent wiki and to your FAH-specific cpu stressing software) so I'm not exactly sure where to start with regards to troubleshooting on my end.

Here's the log:

Code: Select all

08:14:31:WU00:FS00:Starting
08:14:31:WU00:FS00:Removing old file './work/00/logfile_01-20200416-074250.txt'
08:14:31:WU00:FS00:Running FahCore: /usr/bin/FAHCoreWrapper /var/lib/fahclient/cores/cores.foldingathome.org/v7/lin/64bit/avx/Core_a7.fah/FahCore_a7 -dir 00 -suffix 01 -version 704 -lifeline 513 -checkpoint 15 -np 8
08:14:31:WU00:FS00:Started FahCore on PID 855
08:14:31:WU00:FS00:Core PID:859
08:14:31:WU00:FS00:FahCore 0xa7 started
08:14:31:WU00:FS00:0xa7:*********************** Log Started 2020-04-16T08:14:31Z ***********************
08:14:31:WU00:FS00:0xa7:************************** Gromacs Folding@home Core ***************************
08:14:31:WU00:FS00:0xa7:       Type: 0xa7
08:14:31:WU00:FS00:0xa7:       Core: Gromacs
08:14:31:WU00:FS00:0xa7:       Args: -dir 00 -suffix 01 -version 704 -lifeline 855 -checkpoint 15 -np 8
08:14:31:WU00:FS00:0xa7:************************************ CBang *************************************
08:14:31:WU00:FS00:0xa7:       Date: Nov 5 2019
08:14:31:WU00:FS00:0xa7:       Time: 06:06:57
08:14:31:WU00:FS00:0xa7:   Revision: 46c96f1aa8419571d83f3e63f9c99a0d602f6da9
08:14:31:WU00:FS00:0xa7:     Branch: master
08:14:31:WU00:FS00:0xa7:   Compiler: GNU 8.3.0
08:14:31:WU00:FS00:0xa7:    Options: -std=c++11 -O3 -funroll-loops -fno-pie -fPIC
08:14:31:WU00:FS00:0xa7:   Platform: linux2 4.19.0-5-amd64
08:14:31:WU00:FS00:0xa7:       Bits: 64
08:14:31:WU00:FS00:0xa7:       Mode: Release
08:14:31:WU00:FS00:0xa7:************************************ System ************************************
08:14:31:WU00:FS00:0xa7:        CPU: Intel(R) Core(TM) i7-2600K CPU @ 3.40GHz
08:14:31:WU00:FS00:0xa7:     CPU ID: GenuineIntel Family 6 Model 42 Stepping 7
08:14:31:WU00:FS00:0xa7:       CPUs: 8
08:14:31:WU00:FS00:0xa7:     Memory: 15.63GiB
08:14:31:WU00:FS00:0xa7:Free Memory: 15.40GiB
08:14:31:WU00:FS00:0xa7:    Threads: POSIX_THREADS
08:14:31:WU00:FS00:0xa7: OS Version: 4.9
08:14:31:WU00:FS00:0xa7:Has Battery: false
08:14:31:WU00:FS00:0xa7: On Battery: false
08:14:31:WU00:FS00:0xa7: UTC Offset: -7
08:14:31:WU00:FS00:0xa7:        PID: 859
08:14:31:WU00:FS00:0xa7:        CWD: /var/lib/fahclient/work
08:14:31:WU00:FS00:0xa7:******************************** Build - libFAH ********************************
08:14:31:WU00:FS00:0xa7:    Version: 0.0.18
08:14:31:WU00:FS00:0xa7:     Author: Joseph Coffland <joseph@cauldrondevelopment.com>
08:14:31:WU00:FS00:0xa7:  Copyright: 2019 foldingathome.org
08:14:31:WU00:FS00:0xa7:   Homepage: https://foldingathome.org/
08:14:31:WU00:FS00:0xa7:       Date: Nov 5 2019
08:14:31:WU00:FS00:0xa7:       Time: 06:13:26
08:14:31:WU00:FS00:0xa7:   Revision: 490c9aa2957b725af319379424d5c5cb36efb656
08:14:31:WU00:FS00:0xa7:     Branch: master
08:14:31:WU00:FS00:0xa7:   Compiler: GNU 8.3.0
08:14:31:WU00:FS00:0xa7:    Options: -std=c++11 -O3 -funroll-loops -fno-pie
08:14:31:WU00:FS00:0xa7:   Platform: linux2 4.19.0-5-amd64
08:14:31:WU00:FS00:0xa7:       Bits: 64
08:14:31:WU00:FS00:0xa7:       Mode: Release
08:14:31:WU00:FS00:0xa7:************************************ Build *************************************
08:14:31:WU00:FS00:0xa7:       SIMD: avx_256
08:14:31:WU00:FS00:0xa7:********************************************************************************
08:14:31:WU00:FS00:0xa7:Project: 14531 (Run 0, Clone 1305, Gen 17)
08:14:31:WU00:FS00:0xa7:Unit: 0x0000001a80fccb0a5e6978bc26ce4efd
08:14:31:WU00:FS00:0xa7:Digital signatures verified
08:14:31:WU00:FS00:0xa7:Calling: mdrun -s frame17.tpr -o frame17.trr -cpi state.cpt -cpt 15 -nt 8
08:14:31:WU00:FS00:0xa7:Steps: first=4250000 total=250000
08:14:34:WU00:FS00:0xa7:Completed 223571 out of 250000 steps (89%)
08:14:36:WU00:FS00:0xa7:ERROR:
08:14:36:WU00:FS00:0xa7:ERROR:-------------------------------------------------------
08:14:36:WU00:FS00:0xa7:ERROR:Program GROMACS, VERSION 5.0.4-20191026-456f0d636-unknown
08:14:36:WU00:FS00:0xa7:ERROR:Source code file: /host/debian-stable-64bit-core-a7-avx-release/gromacs-core/build/gromacs/src/gromacs/mdlib/pme.c, line: 754
08:14:36:WU00:FS00:0xa7:ERROR:
08:14:36:WU00:FS00:0xa7:ERROR:Fatal error:
08:14:36:WU00:FS00:0xa7:ERROR:7 particles communicated to PME rank 0 are more than 2/3 times the cut-off out of the domain decomposition cell of their charge group in dimension x.
08:14:36:WU00:FS00:0xa7:ERROR:This usually means that your system is not well equilibrated.
08:14:36:WU00:FS00:0xa7:ERROR:For more information and tips for troubleshooting, please check the GROMACS
08:14:36:WU00:FS00:0xa7:ERROR:website at http://www.gromacs.org/Documentation/Errors
08:14:36:WU00:FS00:0xa7:ERROR:-------------------------------------------------------
08:14:36:WU00:FS00:0xa7:ERROR:
08:14:36:WU00:FS00:0xa7:ERROR:-------------------------------------------------------
08:14:36:WU00:FS00:0xa7:ERROR:Program GROMACS, VERSION 5.0.4-20191026-456f0d636-unknown
08:14:36:WU00:FS00:0xa7:ERROR:Source code file: /host/debian-stable-64bit-core-a7-avx-release/gromacs-core/build/gromacs/src/gromacs/mdlib/pme.c, line: 754
08:14:36:WU00:FS00:0xa7:ERROR:
08:14:36:WU00:FS00:0xa7:ERROR:Fatal error:
08:14:36:WU00:FS00:0xa7:ERROR:2 particles communicated to PME rank 7 are more than 2/3 times the cut-off out of the domain decomposition cell of their charge group in dimension x.
08:14:36:WU00:FS00:0xa7:ERROR:This usually means that your system is not well equilibrated.
08:14:36:WU00:FS00:0xa7:ERROR:For more information and tips for troubleshooting, please check the GROMACS
08:14:36:WU00:FS00:0xa7:ERROR:website at http://www.gromacs.org/Documentation/Errors
08:14:36:WU00:FS00:0xa7:ERROR:-------------------------------------------------------
08:14:41:WU00:FS00:FahCore returned: INTERRUPTED (102 = 0x66)
This machine has been chewing through plenty of WUs for the past few weeks so either there's an issue with the WU or my machine is giving up the ghost (it's not overclocked). It runs a minimal GUI-less install of Debian 9 and has been running F@H exclusively.

Re: Project: 14531 (Run 0, Clone 1305, Gen 17)

Posted: Thu Apr 16, 2020 9:34 am
by PantherX
Welcome to the F@H Forum AkrionXxarr,

Apologies for the outdated Sticky, a new version will be published by the end of this week that will fix it.

I have to say that this is a different type of domain decomposition issue that I haven't seen before. You're folding with 8 CPUs which is a stable number. It seems that you did manage to fold until 89% and then something didn't go to plan. Do you know if there were any changes made to the system or was it restarted?

It would be a good idea to get the first ~100 lines from your log file which will include the system configuration and the client settings.

Re: Project: 14531 (Run 0, Clone 1305, Gen 17)

Posted: Thu Apr 16, 2020 10:36 am
by AkrionXxarr
Thanks for the welcome! And no worries about the sticky.

I hadn't made any changes to the machine, no. It stalled overnight. I had reset the server after a pause/unpause failed to work incase it needed a reboot after running for ~3 weeks but that didn't fix it. I had the FAH client paused while waiting for a reply to this thread and when I went to unpause it the WU was immediately sent (presumably at 89%) and I got a new one, so it doesn't look like I can generate any further logs from that particular WU.

The log I pasted in my initial post what the advanced client control let me copy. I just went into the server and reloaded the client so hopefully this is what you were looking for:

Code: Select all

*********************** Log Started 2020-04-16T10:31:40Z ***********************
10:31:40:************************* Folding@home Client *************************
10:31:40:    Website: http://folding.stanford.edu/
10:31:40:  Copyright: (c) 2009-2014 Stanford University
10:31:40:     Author: Joseph Coffland <joseph@cauldrondevelopment.com>
10:31:40:       Args: --child --lifeline 511 /etc/fahclient/config.xml --run-as fahclient
10:31:40:             --pid-file=/var/run/fahclient.pid --daemon
10:31:40:     Config: /etc/fahclient/config.xml
10:31:40:******************************** Build ********************************
10:31:40:    Version: 7.4.4
10:31:40:       Date: Mar 4 2014
10:31:40:       Time: 12:02:38
10:31:40:    SVN Rev: 4130
10:31:40:     Branch: fah/trunk/client
10:31:40:   Compiler: GNU 4.4.7
10:31:40:    Options: -std=gnu++98 -O3 -funroll-loops -mfpmath=sse -ffast-math
10:31:40:             -fno-unsafe-math-optimizations -msse2
10:31:40:   Platform: linux2 3.2.0-1-amd64
10:31:40:       Bits: 64
10:31:40:       Mode: Release
10:31:40:******************************* System ********************************
10:31:40:        CPU: Intel(R) Core(TM) i7-2600K CPU @ 3.40GHz
10:31:40:     CPU ID: GenuineIntel Family 6 Model 42 Stepping 7
10:31:40:       CPUs: 8
10:31:40:     Memory: 15.63GiB
10:31:40:Free Memory: 15.42GiB
10:31:40:    Threads: POSIX_THREADS
10:31:40: OS Version: 4.9
10:31:40:Has Battery: false
10:31:40: On Battery: false
10:31:40: UTC Offset: -7
10:31:40:        PID: 1046
10:31:40:        CWD: /var/lib/fahclient
10:31:40:         OS: Linux 4.9.0-3-amd64 x86_64
10:31:40:    OS Arch: AMD64
10:31:40:       GPUs: 1
10:31:40:      GPU 0: NVIDIA:2 GF116 [GeForce GTX 550 Ti] 691
10:31:40:       CUDA: Not detected
10:31:40:***********************************************************************
10:31:40:<config>
10:31:40:  <!-- HTTP Server -->
10:31:40:  <allow v='192.168.0.0/24'/>
10:31:40:
10:31:40:  <!-- Network -->
10:31:40:  <proxy v=':8080'/>
10:31:40:
10:31:40:  <!-- Remote Command Server -->
10:31:40:  <command-allow-no-pass v='192.168.0.0/24'/>
10:31:40:
10:31:40:  <!-- Slot Control -->
10:31:40:  <power v='full'/>
10:31:40:
10:31:40:  <!-- User Information -->
10:31:40:  <user v='AkrionXxarr'/>
10:31:40:
10:31:40:  <!-- Folding Slots -->
10:31:40:  <slot id='0' type='CPU'>
10:31:40:    <paused v='true'/>
10:31:40:  </slot>
10:31:40:</config>
10:31:40:Switching to user fahclient
10:31:40:Trying to access database...
10:31:40:Successfully acquired database lock
10:31:40:Enabled folding slot 00: PAUSED cpu:8 (by user)
Edit:
Also when I did some searching around for this error message prior to posting the only stuff I could find seemed to indicate it's an issue with the simulation.
See article 4.3 on this page: http://www.gromacs.org/Documentation/Errors
(This is the best I could find with my admittedly limited research)

Re: Project: 14531 (Run 0, Clone 1305, Gen 17)

Posted: Sat Apr 18, 2020 1:24 am
by PantherX
Using 8 CPUs and getting that error is weird. There's a possibility that it could be a bad WU.

I also noticed that you're not using a passkey which is recommended to use due to security reasons and bonus points. You can read more about it here: https://foldingathome.org/support/faq/points/passkey/

Re: Project: 14531 (Run 0, Clone 1305, Gen 17)

Posted: Sun Apr 19, 2020 8:05 am
by AkrionXxarr
Yeah I've got the passkey set up now.

In any case the machine that ran into trouble appears to have comfortably chewed through another handful of WUs (no idea how many), so I'm leaning towards this being a problem with that specific WU.

Re: Project: 14531 (Run 0, Clone 1305, Gen 17)

Posted: Sun Apr 19, 2020 2:38 pm
by HendricksSA
I am fuzzy on my core AVX-256 specifics. Does that require just AVX on the processor or does it need a higher version, like AVX2? Since AkrionXxarr has been completing WUs, I guess original AVX on the I7 2600K is enough. Can someone jog my memory?

Re: Project: 14531 (Run 0, Clone 1305, Gen 17)

Posted: Sun Apr 19, 2020 6:03 pm
by Joe_H
Original AVX supports some operations as 256-bit, and also supports SIMD-128 bit operations. AVX2 adds additional 256-bit operations.

On Intel processors, AVX has been supported since the Sandy Bridge Core i-series as 256-bit. Some of the early AMD processors which support AVX do it with 128-bit operations

Re: Project: 14531 (Run 0, Clone 1305, Gen 17)

Posted: Sun Apr 19, 2020 9:02 pm
by AkrionXxarr
To add some extra information I've got a second identical linux machine that's been folding without issue. As of right now I've completed 582 WUs total and I'd guess that my desktop has completed maybe 40-45% of them (it's far more powerful than the linux machines CPU-wise and is also running GPU projects whereas the linux machines are limited to CPU projects) so I'd say my two linux machines have completed roughly 160 WUs each. This is why I figured it was either an issue with the WU or a very recent issue with my hardware.