13422 failing on RX 5700XT Linux
Moderators: Site Moderators, FAHC Science Team
-
- Posts: 1996
- Joined: Sun Mar 22, 2020 5:52 pm
- Hardware configuration: 1: 2x Xeon E5-2697v3@2.60GHz, 512GB DDR4 LRDIMM, SSD Raid, Win10 Ent 20H2, Quadro K420 1GB, FAH 7.6.21
2: Xeon E3-1505Mv5@2.80GHz, 32GB DDR4, NVME, Win10 Pro 20H2, Quadro M1000M 2GB, FAH 7.6.21 (actually have two of these)
3: i7-960@3.20GHz, 12GB DDR3, SSD, Win10 Pro 20H2, GTX 750Ti 2GB, GTX 1080Ti 11GB, FAH 7.6.21 - Location: UK
Re: 13422 failing on RX 5700XT Linux
It could be that the sanity checks/checkpoints are after steps 250 or 501 ... That may be why the errors always show up at that point as until checks are done then the core doesn't know there is an issue?
2x Xeon E5-2697v3, 512GB DDR4 LRDIMM, SSD Raid, W10-Ent, Quadro K420
Xeon E3-1505Mv5, 32GB DDR4, NVME, W10-Pro, Quadro M1000M
i7-960, 12GB DDR3, SSD, W10-Pro, GTX1080Ti
i9-10850K, 64GB DDR4, NVME, W11-Pro, RTX3070
(Green/Bold = Active)
Xeon E3-1505Mv5, 32GB DDR4, NVME, W10-Pro, Quadro M1000M
i7-960, 12GB DDR3, SSD, W10-Pro, GTX1080Ti
i9-10850K, 64GB DDR4, NVME, W11-Pro, RTX3070
(Green/Bold = Active)
Re: 13422 failing on RX 5700XT Linux
Vijay Pande, the founder of FAH, is currently trying to make a lot of money for startup companies doing innovative work. I hope that includes anything they find useful from FAH. They need a lot of money from the successes to cover the failures, which are usually about 90% in risky fields that are being newly developed. They will use patents I expect to help them compete and bring products to market. If universities can do it better, that is fine with me. But I haven't seen it yet. (Universities get patents too by the way, and like to make as much money from them as they can.)If open science (for example science done at universities, financed by public money) will help us to have drugs approximately at the price of generics from the beginning, this will be a big advance for health and life of people, for which otherwise treatment will not be available.
-
- Posts: 79
- Joined: Fri May 29, 2020 4:10 pm
Re: 13422 failing on RX 5700XT Linux
I hope that the results from FAH and project moonshot are published for unrestricted use instead of patented. At least that is what the promises look like and what should be self-evident for results based on "donors" electricity bills.JimF wrote:Vijay Pande, the founder of FAH, is currently trying to make a lot of money for startup companies doing innovative work. I hope that includes anything they find useful from FAH. They need a lot of money from the successes to cover the failures, which are usually about 90% in risky fields that are being newly developed. They will use patents I expect to help them compete and bring products to market. If universities can do it better, that is fine with me. But I haven't seen it yet. (Universities get patents too by the way, and like to make as much money from them as they can.)If open science (for example science done at universities, financed by public money) will help us to have drugs approximately at the price of generics from the beginning, this will be a big advance for health and life of people, for which otherwise treatment will not be available.
Re: 13422 failing on RX 5700XT Linux
I'm not in the drug business ... and I'm not one to defend their high prices. When I went to work in aerospace engineering, it was very clear that our job was to design products that could be sold to our customers ... and if it included an innovation that could be patented because it was a really novel idea, that was especially good for the engineer who invented it. I don't think it's any different in Big Pharma. Research performed for public university cannot be patented, but the original work that happens to use that research can be.
Posting FAH's log:
How to provide enough info to get helpful support.
How to provide enough info to get helpful support.
Re: 13422 failing on RX 5700XT Linux
That could be right, but I don't think so. If a NaNs occurs before reaching a sanity check, it should fail there. Another possibility is that the message which is being issued isn't an accurate description of the problem. Running the error condition locally should quickly isolate that possibility whether in John's lab or ThWuensche'sNeil-B wrote:It could be that the sanity checks/checkpoints are after steps 250 or 501 ... That may be why the errors always show up at that point as until checks are done then the core doesn't know there is an issue?
The first part of the log lists events that happen periodically. Do any of them happen at multiples of 250 steps
Posting FAH's log:
How to provide enough info to get helpful support.
How to provide enough info to get helpful support.
-
- Pande Group Member
- Posts: 467
- Joined: Fri Feb 22, 2013 9:59 pm
Re: 13422 failing on RX 5700XT Linux
> I hope that the results from FAH and project moonshot are published for unrestricted use instead of patented. At least that is what the promises look like and what should be self-evident for results based on "donors" electricity bills.
The COVID Moonshot data is all being put into the public domain, and has committed to making all data public and everything free of IP (no patents!) here: https://postera.ai/covid.
For now, the molecules we designed are all public and free of any IP protections, and all the data we collect will also be public---see here for more info: https://foldingathome.org/2020/08/24/co ... -sprint-3/
We have some philanthropic institutes lined up to help push to clinical trials so that we can deliver a drug that is as low-cost as possible that can be made by multiple manufacturers around the world.
We're working hard on automating the whole pipeline to bring you a real-time leaderboard for all compounds and compound statistics for each sprint so you can see the results of all calculations in real time.
We're also working to start automatically sweeping the raw simulation data (which is going to be much less exciting to the public, but useful to scientists a bit later) online, and will then be pushing it to the public server as soon as it is produced. It's just taking us a little while to get that infrastructure worked out, but it's coming very soon!
~ John Chodera // MSKCC
The COVID Moonshot data is all being put into the public domain, and has committed to making all data public and everything free of IP (no patents!) here: https://postera.ai/covid.
For now, the molecules we designed are all public and free of any IP protections, and all the data we collect will also be public---see here for more info: https://foldingathome.org/2020/08/24/co ... -sprint-3/
We have some philanthropic institutes lined up to help push to clinical trials so that we can deliver a drug that is as low-cost as possible that can be made by multiple manufacturers around the world.
We're working hard on automating the whole pipeline to bring you a real-time leaderboard for all compounds and compound statistics for each sprint so you can see the results of all calculations in real time.
We're also working to start automatically sweeping the raw simulation data (which is going to be much less exciting to the public, but useful to scientists a bit later) online, and will then be pushing it to the public server as soon as it is produced. It's just taking us a little while to get that infrastructure worked out, but it's coming very soon!
~ John Chodera // MSKCC
-
- Pande Group Member
- Posts: 467
- Joined: Fri Feb 22, 2013 9:59 pm
Re: 13422 failing on RX 5700XT Linux
@ThWuensche: This is great! We now have a test case that makes it very easy to reproduce the issue! Can you post this on the OpenMM issue tracker? Our lead OpenMM developer Peter Eastman can work with you there to try a few more things to debug:With the information in your post I have installed miniconda on one more computer, installed openmm, downloaded the zip-file with tests and run RUN9. First it would break with particle coordinate nan before reporting any iterations, after setting the steps_per_iteration to one and increasing niterations it breaks after "completed 250 steps". That is a first hint, since I verified in the logs (FAH) on that computer that in most cases it also breaks at step 250. This indicates that it might be not a result of a calculation running away, but something systematically linked to step 250. Besides many occurrences of step 250 there are some with step 501. From the FAH logs I assumed that maybe 250 would be a first verification point and that would be the reason for that coincidence, but from running openmm in single iterations leading to the same result (counting up all steps before in the console output) that coincidence raises questions.
http://github.com/openmm/openmm
Please tag me there as @jchodera and I will chime in with more information and the test scripts.
If you can also post more details of your configuration (a FAH science logs header will do!) there, it will help keep the information organized.
Thank you for helping us get to the bottom of this!
~ John Chodera // MSKCC
-
- Posts: 79
- Joined: Fri May 29, 2020 4:10 pm
Re: 13422 failing on RX 5700XT Linux
The issue is already open: https://github.com/openmm/openmm/issues/2813. As mentioned, if run with precision double it does not break.JohnChodera wrote:
@ThWuensche: This is great! We now have a test case that makes it very easy to reproduce the issue! Can you post this on the OpenMM issue tracker? Our lead OpenMM developer Peter Eastman can work with you there to try a few more things to debug:
http://github.com/openmm/openmm
Please tag me there as @jchodera and I will chime in with more information and the test scripts.
If you can also post more details of your configuration (a FAH science logs header will do!) there, it will help keep the information organized.
Thank you for helping us get to the bottom of this!
~ John Chodera // MSKCC
The captured WU is 13422,4371,95,2. Here is the science.log from the time I rsynced the directory:
Code: Select all
*************************** Core22 Folding@home Core ***************************
Core: Core22
Type: 0x22
Version: 0.0.11
Author: Joseph Coffland <joseph@cauldrondevelopment.com>
Copyright: 2020 foldingathome.org
Homepage: https://foldingathome.org/
Date: Jun 27 2020
Time: 22:50:00
Revision: cfc2940c5dd1aa80f60daa6e28d4a2a417f74edb
Branch: core22-0.0.11
Compiler: GNU 4.8.2 20140120 (Red Hat 4.8.2-15)
Options: -std=c++11 -fsigned-char -ffunction-sections -fdata-sections -O3
-funroll-loops
Platform: linux2 4.19.76-linuxkit
Bits: 64
Mode: Release
Maintainers: John Chodera <john.chodera@choderalab.org> and Peter Eastman
<peastman@stanford.edu>
Args: -dir 03 -suffix 01 -version 706 -lifeline 12914 -checkpoint 15
-gpu-vendor amd -opencl-platform 0 -opencl-device 0 -gpu 0
************************************ libFAH ************************************
Date: Jun 27 2020
Time: 22:11:04
Revision: 2b383f4f04f38511dff592885d7c0400e72bdf43
Branch: HEAD
Compiler: GNU 4.8.2 20140120 (Red Hat 4.8.2-15)
Options: -std=c++11 -fsigned-char -ffunction-sections -fdata-sections -O3
-funroll-loops
Platform: linux2 4.19.76-linuxkit
Bits: 64
Mode: Release
************************************ CBang *************************************
Date: Jun 27 2020
Time: 22:10:11
Revision: f8529962055b0e7bde23e429f5072ff758089dee
Branch: HEAD
Compiler: GNU 4.8.2 20140120 (Red Hat 4.8.2-15)
Options: -std=c++11 -fsigned-char -ffunction-sections -fdata-sections -O3
-funroll-loops -fPIC
Platform: linux2 4.19.76-linuxkit
Bits: 64
Mode: Release
************************************ System ************************************
CPU: AMD Ryzen 7 3700X 8-Core Processor
CPU ID: AuthenticAMD Family 23 Model 113 Stepping 0
CPUs: 16
Memory: 15.59GiB
Free Memory: 11.82GiB
Threads: POSIX_THREADS
OS Version: 5.7
Has Battery: false
On Battery: false
UTC Offset: 2
PID: 12918
CWD: /var/lib/fahclient/work
********************************************************************************
Folding@home GPU Core22 Folding@home Core
Version 0.0.11
[1] compatible platform(s):
-- 0 --
PROFILE = FULL_PROFILE
VERSION = OpenCL 2.0 AMD-APP (3137.0)
NAME = AMD Accelerated Parallel Processing
VENDOR = Advanced Micro Devices, Inc.
(2) device(s) found on platform 0:
-- 0 --
DEVICE_NAME = gfx906+sram-ecc
DEVICE_VENDOR = Advanced Micro Devices, Inc.
DEVICE_VERSION = OpenCL 2.0
-- 1 --
DEVICE_NAME = gfx906+sram-ecc
DEVICE_VENDOR = Advanced Micro Devices, Inc.
DEVICE_VERSION = OpenCL 2.0
[ Entering Init ]
Launch time: 2020-08-24T17:43:14Z
Arguments passed: -dir 03 -suffix 01 -version 706 -lifeline 12914 -checkpoint 15 -gpu-vendor amd -opencl-platform 0 -opencl-device 0 -gpu 0
For testState comparison of CPU and GPU, will use:
forceTolerance: 5 kJ/mol/nm
energyTolerance: 10 kJ/mol
[ Leaving Init ]
[ Entering Main ]
Reading core settings...
Total number of steps: 1000000
Checkpoint write interval: 50000 steps (5%) [20 total]
JSON viewer frame write interval: 10000 steps (1%) [100 total]
XTC frame write interval: 250000 steps (25%) [4 total]
Global context and integrator variables write interval: 25000 steps (2.5%) [40 total]
[ Initializing Core Contexts ]
Using platform OpenCL
Looking for vendor: amd...found on platformId 0
Setting platform precision to mixed
Setting DisablePmeStream to 1
Checking for integrator.xml
Found integrator.xml
Loading integrator from integrator.xml
Stream copied, deserializing...
Checking for integrator.xml
Found integrator.xml
Loading integrator from integrator.xml
Stream copied, deserializing...
Checking for system.xml
Checking for system.xml.gz
Checking for system.xml.bz2
Found system.xml.bz2
Deserializing System...successful.
Found 90551 atoms, 10 forces.
Finding State XML file...
Checking for state.xml
Checking for state.xml.gz
Checking for state.xml.bz2
Found state.xml.bz2
Deserializing State...successful.
Ewald error tolerance in force 7 is 0.00025
Ewald parameters: alpha 2.75697 nx 96 ny 96 nz 96
Integrator Type: N6OpenMM16CustomIntegratorE
Constraint Tolerance: 1e-08
Time Step in PS: 0.004
Using CPU platform for reference calculations.
Performing initial sanity checks before starting work...
Comparing forces and energies between initial State and CPU...
Comparing forces and energies between GPU and CPU...
-
- Pande Group Member
- Posts: 467
- Joined: Fri Feb 22, 2013 9:59 pm
Re: 13422 failing on RX 5700XT Linux
Thanks, @ThWuensche! We'll continue the investigation on the OpenMM issue tracker and update everyone here with what we find.
~ John Chodera // MSKCC
~ John Chodera // MSKCC
-
- Pande Group Member
- Posts: 467
- Joined: Fri Feb 22, 2013 9:59 pm
Re: 13422 failing on RX 5700XT Linux
There's another VEGA-specific issue reported in the OpenMM issue tracker that may be related to the issues we're seeing here:
https://github.com/openmm/openmm/issues/2817
I'll keep you folks updated with what we find.
~ John Chodera // MSKCC
https://github.com/openmm/openmm/issues/2817
I'll keep you folks updated with what we find.
~ John Chodera // MSKCC