Corrupted / bad job 18237/1069/0/71 (failing for all users)

Moderators: Site Moderators, FAHC Science Team

PaulTV
Posts: 210
Joined: Mon Jan 25, 2021 4:53 pm
Location: Netherlands

Corrupted / bad job 18237/1069/0/71 (failing for all users)

Post by PaulTV »

Hi,

Job https://apps.foldingathome.org/wu#proje ... e=0&gen=71 is failing all the time on different systems, please pull it
Image

Ryzen 9800X3D / RTX 4090 / Windows 11
Ryzen 5600X / RTX 3070 Ti / Ubuntu 22.04
Ryzen 5600 / RTX 3060 Ti / Windows 11
Nicolas_orleans
Posts: 114
Joined: Wed Aug 08, 2012 3:08 am

Re: Corrupted / bad job 18237/1069/0/71 (failing for all users)

Post by Nicolas_orleans »

Hi Paul,
On my system, out of the 17 different GPU projects being assigned to my system since October, P18237 is the only one failing regularly (but not for 100% of WUs) with Force RMSE errors.
The issue may be wider than your particular WU ?
Best regards
Nicolas
PaulTV
Posts: 210
Joined: Mon Jan 25, 2021 4:53 pm
Location: Netherlands

Re: Corrupted / bad job 18237/1069/0/71 (failing for all users)

Post by PaulTV »

Hola,

So far I've done 120 jobs from P18237 succesfully, this particular one is the first one failing.

I don't know the science behind the cores and the jobs. If this project fails more often on your machine, while other projects all run fine, it makes me wonder what it's doing that's so special...
Image

Ryzen 9800X3D / RTX 4090 / Windows 11
Ryzen 5600X / RTX 3070 Ti / Ubuntu 22.04
Ryzen 5600 / RTX 3060 Ti / Windows 11
PaulTV
Posts: 210
Joined: Mon Jan 25, 2021 4:53 pm
Location: Netherlands

Re: Corrupted / bad job 18237/1069/0/71 (failing for all users)

Post by PaulTV »

See below the full log for this particular job. It appears the potential energy is off at the starting point for this job.

Code: Select all

12:30:27:WU00:FS01:Connecting to assign1.foldingathome.org:80
12:30:27:WU00:FS01:Assigned to work server 158.130.118.23
12:30:27:WU00:FS01:Requesting new work unit for slot 01: gpu:7:0 AD102 [GeForce RTX 4090] from 158.130.118.23
12:30:27:WU00:FS01:Connecting to 158.130.118.23:8080
12:30:28:WU00:FS01:Downloading 10.83MiB
12:30:29:WU00:FS01:Download complete
12:30:29:WU00:FS01:Received Unit: id:00 state:DOWNLOAD error:NO_ERROR project:18237 run:1069 clone:0 gen:71 core:0x24 unit:0x00000000000000470000473d0000042d
12:30:59:WU00:FS01:Starting
12:30:59:WU00:FS01:Running FahCore: "C:\Program Files (x86)\FAHClient/FAHCoreWrapper.exe" C:\ProgramData\FAHClient\cores/cores.foldingathome.org/openmm-core-24/windows-10-64bit/release/0x24-8.1.4/Core_24.fah/FahCore_24.exe -dir 00 -suffix 01 -version 706 -lifeline 21196 -checkpoint 15 -opencl-platform 0 -opencl-device 0 -cuda-device 0 -gpu-vendor nvidia -gpu 0 -gpu-usage 100
12:30:59:WU00:FS01:Started FahCore on PID 8148
12:30:59:WU00:FS01:Core PID:9244
12:30:59:WU00:FS01:FahCore 0x24 started
12:30:59:WU00:FS01:0x24:*********************** Log Started 2024-11-05T12:30:59Z ***********************
12:30:59:WU00:FS01:0x24:*************************** Core24 Folding@home Core ***************************
12:30:59:WU00:FS01:0x24:       Core: Core24
12:30:59:WU00:FS01:0x24:       Type: 0x24
12:30:59:WU00:FS01:0x24:    Version: 8.1.4
12:30:59:WU00:FS01:0x24:     Author: Joseph Coffland <joseph@cauldrondevelopment.com>
12:30:59:WU00:FS01:0x24:  Copyright: 2022 foldingathome.org
12:30:59:WU00:FS01:0x24:   Homepage: https://foldingathome.org/
12:30:59:WU00:FS01:0x24:       Date: Jul 25 2024
12:30:59:WU00:FS01:0x24:       Time: 05:42:49
12:30:59:WU00:FS01:0x24:   Revision: cf9f0139862b8945a2091772770e4631aac37792
12:30:59:WU00:FS01:0x24:     Branch: HEAD
12:30:59:WU00:FS01:0x24:   Compiler: Visual C++
12:30:59:WU00:FS01:0x24:    Options: $( /TP $) /std:c++14 /nologo /EHa /wd4297 /wd4103 /O2
12:30:59:WU00:FS01:0x24:             /Zc:throwingNew /MT -DOPENMM_VERSION="\"8.1.1\"" /Ox /std:c++14
12:30:59:WU00:FS01:0x24:   Platform: win32 10
12:30:59:WU00:FS01:0x24:       Bits: 64
12:30:59:WU00:FS01:0x24:       Mode: Release
12:30:59:WU00:FS01:0x24:Maintainers: John Chodera <john.chodera@choderalab.org> and Peter Eastman
12:30:59:WU00:FS01:0x24:             <peastman@stanford.edu>
12:30:59:WU00:FS01:0x24:       Args: -dir 00 -suffix 01 -version 706 -lifeline 8148 -checkpoint 15
12:30:59:WU00:FS01:0x24:             -opencl-platform 0 -opencl-device 0 -cuda-device 0 -gpu-vendor
12:30:59:WU00:FS01:0x24:             nvidia -gpu 0 -gpu-usage 100
12:30:59:WU00:FS01:0x24:************************************ libFAH ************************************
12:30:59:WU00:FS01:0x24:       Date: Jul 25 2024
12:30:59:WU00:FS01:0x24:       Time: 05:23:50
12:30:59:WU00:FS01:0x24:   Revision: c7d2824a47eb025fa8cda8968c7a5e971585d90c
12:30:59:WU00:FS01:0x24:     Branch: HEAD
12:30:59:WU00:FS01:0x24:   Compiler: Visual C++
12:30:59:WU00:FS01:0x24:    Options: $( /TP $) /nologo /EHa /wd4297 /wd4103 /O2 /Zc:throwingNew /MT
12:30:59:WU00:FS01:0x24:   Platform: win32 10
12:30:59:WU00:FS01:0x24:       Bits: 64
12:30:59:WU00:FS01:0x24:       Mode: Release
12:30:59:WU00:FS01:0x24:************************************ CBang *************************************
12:30:59:WU00:FS01:0x24:    Version: 1.7.2
12:30:59:WU00:FS01:0x24:     Author: Joseph Coffland <joseph@cauldrondevelopment.com>
12:30:59:WU00:FS01:0x24:        Org: Cauldron Development LLC
12:30:59:WU00:FS01:0x24:  Copyright: Cauldron Development LLC, 2003-2024
12:30:59:WU00:FS01:0x24:   Homepage: https://cauldrondevelopment.com/
12:30:59:WU00:FS01:0x24:    License: LGPL-2.1-or-later
12:30:59:WU00:FS01:0x24:       Date: Jul 25 2024
12:30:59:WU00:FS01:0x24:       Time: 05:22:43
12:30:59:WU00:FS01:0x24:   Revision: f1cd4c791e8c40a35dcfeab3ab85d910949cc0cb
12:30:59:WU00:FS01:0x24:     Branch: HEAD
12:30:59:WU00:FS01:0x24:   Compiler: Visual C++
12:30:59:WU00:FS01:0x24:    Options: $( /TP $) /nologo /EHa /wd4297 /wd4103 /O2 /Zc:throwingNew /MT
12:30:59:WU00:FS01:0x24:   Platform: win32 10
12:30:59:WU00:FS01:0x24:       Bits: 64
12:30:59:WU00:FS01:0x24:       Mode: Release
12:30:59:WU00:FS01:0x24:************************************ System ************************************
12:30:59:WU00:FS01:0x24:        CPU: AMD Ryzen 7 5800X 8-Core Processor
12:30:59:WU00:FS01:0x24:     CPU ID: AuthenticAMD Family 25 Model 33 Stepping 0
12:30:59:WU00:FS01:0x24:       CPUs: 16
12:30:59:WU00:FS01:0x24:     Memory: 31.89GiB
12:30:59:WU00:FS01:0x24:Free Memory: 25.69GiB
12:30:59:WU00:FS01:0x24: OS Version: 10.0
12:30:59:WU00:FS01:0x24:Has Battery: false
12:30:59:WU00:FS01:0x24: On Battery: false
12:30:59:WU00:FS01:0x24:   Hostname: Desktop
12:30:59:WU00:FS01:0x24: UTC Offset: 1
12:30:59:WU00:FS01:0x24:        PID: 9244
12:30:59:WU00:FS01:0x24:        CWD: C:\ProgramData\FAHClient\work
12:30:59:WU00:FS01:0x24:       Exec: C:\ProgramData\FAHClient\cores\cores.foldingathome.org\openmm-core-24\windows-10-64bit\release\0x24-8.1.4\Core_24.fah\FahCore_24.exe
12:30:59:WU00:FS01:0x24:************************************ OpenMM ************************************
12:30:59:WU00:FS01:0x24:    Version: 8.1.1
12:30:59:WU00:FS01:0x24:********************************************************************************
12:30:59:WU00:FS01:0x24:Project: 18237 (Run 1069, Clone 0, Gen 71)
12:30:59:WU00:FS01:0x24:Reading tar file core.xml
12:30:59:WU00:FS01:0x24:Reading tar file integrator.xml
12:30:59:WU00:FS01:0x24:Reading tar file state.xml.bz2
12:30:59:WU00:FS01:0x24:Reading tar file system.xml.bz2
12:30:59:WU00:FS01:0x24:Digital signatures verified
12:30:59:WU00:FS01:0x24:Folding@home GPU Core24 Folding@home Core
12:30:59:WU00:FS01:0x24:Version 8.1.4
12:30:59:WU00:FS01:0x24:  Checkpoint write interval: 50000 steps (2%) [50 total]
12:30:59:WU00:FS01:0x24:  JSON viewer frame write interval: 25000 steps (1%) [100 total]
12:30:59:WU00:FS01:0x24:  XTC frame write interval: 10000 steps (0.4%) [250 total]
12:30:59:WU00:FS01:0x24:  TRR frame write interval: disabled
12:30:59:WU00:FS01:0x24:  Global context and integrator variables write interval: disabled
12:30:59:WU00:FS01:0x24:There are 4 platforms available.
12:30:59:WU00:FS01:0x24:Platform 0: Reference
12:30:59:WU00:FS01:0x24:Platform 1: CPU
12:30:59:WU00:FS01:0x24:Platform 2: OpenCL
12:30:59:WU00:FS01:0x24:  opencl-device 0 specified
12:30:59:WU00:FS01:0x24:Platform 3: CUDA
12:30:59:WU00:FS01:0x24:  cuda-device 0 specified
12:31:07:WU00:FS01:0x24:Attempting to create CUDA context:
12:31:07:WU00:FS01:0x24:  Configuring platform CUDA
12:31:09:WU00:FS01:0x24:ERROR:Potential energy error of 296.63, threshold of 20
12:31:09:WU00:FS01:0x24:ERROR:Reference Potential Energy: -1.94858e+06 | Given Potential Energy: -1.94887e+06
12:31:09:WU00:FS01:0x24:Saving result file ..\logfile_01.txt
12:31:10:WU00:FS01:0x24:Saving result file science.log
12:31:10:WU00:FS01:0x24:Saving result file state.xml.bz2
12:31:10:WU00:FS01:0x24:Folding@home Core Shutdown: BAD_WORK_UNIT
12:31:10:WARNING:WU00:FS01:FahCore returned: BAD_WORK_UNIT (114 = 0x72)
12:31:10:WU00:FS01:Sending unit results: id:00 state:SEND error:FAULTY project:18237 run:1069 clone:0 gen:71 core:0x24 unit:0x00000000000000470000473d0000042d
12:31:10:WU00:FS01:Uploading 9.68MiB to 158.130.118.23
12:31:10:WU00:FS01:Connecting to 158.130.118.23:8080
12:31:12:WU00:FS01:Upload complete
12:31:12:WU00:FS01:Server responded WORK_ACK (400)
12:31:12:WU00:FS01:Cleaning up
Image

Ryzen 9800X3D / RTX 4090 / Windows 11
Ryzen 5600X / RTX 3070 Ti / Ubuntu 22.04
Ryzen 5600 / RTX 3060 Ti / Windows 11
PaulTV
Posts: 210
Joined: Mon Jan 25, 2021 4:53 pm
Location: Netherlands

Re: Corrupted / bad job 18237/1069/0/71 (failing for all users)

Post by PaulTV »

Oh wow... this is coincidence. A similar issue with 16780/17/0/107, and I'm not the first who encoutered this either: https://apps.foldingathome.org/wu#proje ... =0&gen=107. I literally ran thousands of jobs on this machine since the last time jobs blew up (and that was not my setup's fault either).

Code: Select all

20:07:48:WU01:FS01:Connecting to assign1.foldingathome.org:80
20:07:48:WU01:FS01:Assigned to work server 128.104.69.82
20:07:48:WU01:FS01:Requesting new work unit for slot 01: gpu:7:0 AD102 [GeForce RTX 4090] from 128.104.69.82
20:07:48:WU01:FS01:Connecting to 128.104.69.82:8080
20:08:00:WU01:FS01:Downloading 50.07MiB
20:08:05:WU01:FS01:Download complete
20:08:05:WU01:FS01:Received Unit: id:01 state:DOWNLOAD error:NO_ERROR project:16780 run:17 clone:0 gen:107 core:0x23 unit:0x6b00000000000000110000008c410000
20:08:21:WU01:FS01:Starting
20:08:21:WU01:FS01:Running FahCore: "C:\Program Files (x86)\FAHClient/FAHCoreWrapper.exe" C:\ProgramData\FAHClient\cores/cores.foldingathome.org/openmm-core-23/windows-10-64bit/release/0x23-8.0.3/Core_23.fah/FahCore_23.exe -dir 01 -suffix 01 -version 706 -lifeline 21196 -checkpoint 15 -opencl-platform 0 -opencl-device 0 -cuda-device 0 -gpu-vendor nvidia -gpu 0 -gpu-usage 100
20:08:21:WU01:FS01:Started FahCore on PID 26128
20:08:21:WU01:FS01:Core PID:26220
20:08:21:WU01:FS01:FahCore 0x23 started
20:08:21:WU01:FS01:0x23:*********************** Log Started 2024-11-06T20:08:21Z ***********************
20:08:21:WU01:FS01:0x23:*************************** Core23 Folding@home Core ***************************
20:08:21:WU01:FS01:0x23:       Core: Core23
20:08:21:WU01:FS01:0x23:       Type: 0x23
20:08:21:WU01:FS01:0x23:    Version: 8.0.3
20:08:21:WU01:FS01:0x23:     Author: Joseph Coffland <joseph@cauldrondevelopment.com>
20:08:21:WU01:FS01:0x23:  Copyright: 2022 foldingathome.org
20:08:21:WU01:FS01:0x23:   Homepage: https://foldingathome.org/
20:08:21:WU01:FS01:0x23:       Date: Aug 3 2023
20:08:21:WU01:FS01:0x23:       Time: 08:39:06
20:08:21:WU01:FS01:0x23:   Compiler: Visual C++
20:08:21:WU01:FS01:0x23:    Options: /TP /nologo /EHa /wd4297 /wd4103 /O2 /Zc:throwingNew /MT
20:08:21:WU01:FS01:0x23:             -DOPENMM_VERSION="\"8.0.0\""
20:08:21:WU01:FS01:0x23:   Platform: win32 10
20:08:21:WU01:FS01:0x23:       Bits: 64
20:08:21:WU01:FS01:0x23:       Mode: Release
20:08:21:WU01:FS01:0x23:Maintainers: John Chodera <john.chodera@choderalab.org> and Peter Eastman
20:08:21:WU01:FS01:0x23:             <peastman@stanford.edu>
20:08:21:WU01:FS01:0x23:       Args: -dir 01 -suffix 01 -version 706 -lifeline 26128 -checkpoint 15
20:08:21:WU01:FS01:0x23:             -opencl-platform 0 -opencl-device 0 -cuda-device 0 -gpu-vendor
20:08:21:WU01:FS01:0x23:             nvidia -gpu 0 -gpu-usage 100
20:08:21:WU01:FS01:0x23:************************************ libFAH ************************************
20:08:21:WU01:FS01:0x23:       Date: Aug 3 2023
20:08:21:WU01:FS01:0x23:       Time: 08:37:55
20:08:21:WU01:FS01:0x23:   Compiler: Visual C++
20:08:21:WU01:FS01:0x23:    Options: /TP /nologo /EHa /wd4297 /wd4103 /O2 /Zc:throwingNew /MT
20:08:21:WU01:FS01:0x23:   Platform: win32 10
20:08:21:WU01:FS01:0x23:       Bits: 64
20:08:21:WU01:FS01:0x23:       Mode: Release
20:08:21:WU01:FS01:0x23:************************************ CBang *************************************
20:08:21:WU01:FS01:0x23:    Version: 1.7.2
20:08:21:WU01:FS01:0x23:     Author: Joseph Coffland <joseph@cauldrondevelopment.com>
20:08:21:WU01:FS01:0x23:        Org: Cauldron Development LLC
20:08:21:WU01:FS01:0x23:  Copyright: Cauldron Development LLC, 2003-2023
20:08:21:WU01:FS01:0x23:   Homepage: https://cauldrondevelopment.com/
20:08:21:WU01:FS01:0x23:    License: GPL 2+
20:08:21:WU01:FS01:0x23:       Date: Aug 3 2023
20:08:21:WU01:FS01:0x23:       Time: 08:37:14
20:08:21:WU01:FS01:0x23:   Compiler: Visual C++
20:08:21:WU01:FS01:0x23:    Options: /TP /nologo /EHa /wd4297 /wd4103 /O2 /Zc:throwingNew /MT
20:08:21:WU01:FS01:0x23:   Platform: win32 10
20:08:21:WU01:FS01:0x23:       Bits: 64
20:08:21:WU01:FS01:0x23:       Mode: Release
20:08:21:WU01:FS01:0x23:************************************ System ************************************
20:08:21:WU01:FS01:0x23:        CPU: AMD Ryzen 7 5800X 8-Core Processor
20:08:21:WU01:FS01:0x23:     CPU ID: AuthenticAMD Family 25 Model 33 Stepping 0
20:08:21:WU01:FS01:0x23:       CPUs: 16
20:08:21:WU01:FS01:0x23:     Memory: 31.89GiB
20:08:21:WU01:FS01:0x23:Free Memory: 23.85GiB
20:08:21:WU01:FS01:0x23:    Threads: WINDOWS_THREADS
20:08:21:WU01:FS01:0x23: OS Version: 6.2
20:08:21:WU01:FS01:0x23:Has Battery: false
20:08:21:WU01:FS01:0x23: On Battery: false
20:08:21:WU01:FS01:0x23: UTC Offset: 1
20:08:21:WU01:FS01:0x23:        PID: 26220
20:08:21:WU01:FS01:0x23:        CWD: C:\ProgramData\FAHClient\work
20:08:21:WU01:FS01:0x23:       Exec: C:\ProgramData\FAHClient\cores\cores.foldingathome.org\openmm-core-23\windows-10-64bit\release\0x23-8.0.3\Core_23.fah\FahCore_23.exe
20:08:21:WU01:FS01:0x23:************************************ OpenMM ************************************
20:08:21:WU01:FS01:0x23:    Version: 8.0.0
20:08:21:WU01:FS01:0x23:********************************************************************************
20:08:21:WU01:FS01:0x23:Project: 16780 (Run 17, Clone 0, Gen 107)
20:08:21:WU01:FS01:0x23:Reading tar file core.xml
20:08:21:WU01:FS01:0x23:Reading tar file integrator.xml
20:08:21:WU01:FS01:0x23:Reading tar file state.xml
20:08:22:WU01:FS01:0x23:Reading tar file system.xml
20:08:22:WU01:FS01:0x23:Digital signatures verified
20:08:22:WU01:FS01:0x23:Folding@home GPU Core23 Folding@home Core
20:08:22:WU01:FS01:0x23:Version 8.0.3
20:08:22:WU01:FS01:0x23:  Checkpoint write interval: 50000 steps (2%) [50 total]
20:08:22:WU01:FS01:0x23:  JSON viewer frame write interval: 25000 steps (1%) [100 total]
20:08:22:WU01:FS01:0x23:  XTC frame write interval: 25000 steps (1%) [100 total]
20:08:22:WU01:FS01:0x23:  Global context and integrator variables write interval: disabled
20:08:23:WU01:FS01:0x23:There are 4 platforms available.
20:08:23:WU01:FS01:0x23:Platform 0: Reference
20:08:23:WU01:FS01:0x23:Platform 1: CPU
20:08:23:WU01:FS01:0x23:Platform 2: OpenCL
20:08:23:WU01:FS01:0x23:  opencl-device 0 specified
20:08:23:WU01:FS01:0x23:Platform 3: CUDA
20:08:23:WU01:FS01:0x23:  cuda-device 0 specified
20:08:51:WU01:FS01:0x23:Attempting to create CUDA context:
20:08:51:WU01:FS01:0x23:  Configuring platform CUDA
20:08:56:WU01:FS01:0x23:ERROR:Discrepancy: Forces are blowing up! 132637 0
20:08:56:WU01:FS01:0x23:Saving result file ..\logfile_01.txt
20:08:56:WU01:FS01:0x23:Saving result file science.log
20:08:56:WU01:FS01:0x23:Saving result file state.xml
20:09:01:WU01:FS01:0x23:Folding@home Core Shutdown: BAD_WORK_UNIT
20:09:02:WARNING:WU01:FS01:FahCore returned: BAD_WORK_UNIT (114 = 0x72)
20:09:02:WU01:FS01:Sending unit results: id:01 state:SEND error:FAULTY project:16780 run:17 clone:0 gen:107 core:0x23 unit:0x6b00000000000000110000008c410000
20:09:02:WU01:FS01:Uploading 41.84MiB to 128.104.69.82
20:09:02:WU01:FS01:Connecting to 128.104.69.82:8080
20:09:07:WU01:FS01:Upload complete
20:09:07:WU01:FS01:Server responded WORK_ACK (400)
20:09:07:WU01:FS01:Cleaning up
Image

Ryzen 9800X3D / RTX 4090 / Windows 11
Ryzen 5600X / RTX 3070 Ti / Ubuntu 22.04
Ryzen 5600 / RTX 3060 Ti / Windows 11
Joe_H
Site Admin
Posts: 7936
Joined: Tue Apr 21, 2009 4:41 pm
Hardware configuration: Mac Pro 2.8 quad 12 GB smp4
MacBook Pro 2.9 i7 8 GB smp2
Location: W. MA

Re: Corrupted / bad job 18237/1069/0/71 (failing for all users)

Post by Joe_H »

I have reported the Project 18237 WU to the researcher, it should have stopped being assigned after multiple failures. It looks like that happened back in October, then started being assigned again 2 days ago. The 5 failures on the Project 16780 WU should be enough to automatically keep it from reassigning, will check on that in a day or so to see if that happens.

As for what can be different between WUs for the same project, each Run starts with a different set of initial conditions. The trajectory calculated from there will be different for each, the final results use statistical analysis to determine the most likely states and pathways between them. Some trajectories do "blow up" and can not proceed further.
Image

iMac 2.8 i7 12 GB smp8, Mac Pro 2.8 quad 12 GB smp6
MacBook Pro 2.9 i7 8 GB smp3
Nicolas_orleans
Posts: 114
Joined: Wed Aug 08, 2012 3:08 am

Re: Corrupted / bad job 18237/1069/0/71 (failing for all users)

Post by Nicolas_orleans »

Hi Paul

I have browsed my logs and here is one sample of the first Force RMSE error I saw mid-October. I have dozens like this, only for this particular project.

Code: Select all

17:41:08:I3:WU40:Started FahCore on PID 8605
17:41:09:I1:WU40:*********************** Log Started 2024-10-18T17:41:09Z ***********************
17:41:09:I1:WU40:*************************** Core24 Folding@home Core ***************************
17:41:09:I1:WU40:       Core: Core24
17:41:09:I1:WU40:       Type: 0x24
17:41:09:I1:WU40:    Version: 8.1.4
17:41:09:I1:WU40:     Author: Joseph Coffland <joseph@cauldrondevelopment.com>
17:41:09:I1:WU40:  Copyright: 2022 foldingathome.org
17:41:09:I1:WU40:   Homepage: https://foldingathome.org/
17:41:09:I1:WU40:       Date: Jul 25 2024
17:41:09:I1:WU40:       Time: 05:19:51
17:41:09:I1:WU40:   Revision: cf9f0139862b8945a2091772770e4631aac37792
17:41:09:I1:WU40:     Branch: HEAD
17:41:09:I1:WU40:   Compiler: GNU 7.5.0
17:41:09:I1:WU40:    Options: -faligned-new -std=c++14 -fsigned-char -ffunction-sections
17:41:09:I1:WU40:             -fdata-sections -O3 -funroll-loops -fno-pie
17:41:09:I1:WU40:             -DOPENMM_VERSION="\"8.1.1\""
17:41:09:I1:WU40:   Platform: linux 6.5.0-1024-azure
17:41:09:I1:WU40:       Bits: 64
17:41:09:I1:WU40:       Mode: Release
17:41:09:I1:WU40:Maintainers: John Chodera <john.chodera@choderalab.org> and Peter Eastman
17:41:09:I1:WU40:             <peastman@stanford.edu>
17:41:09:I1:WU40:       Args: -dir B0nhuCVFSLERWJi2TZDzOpMXnDN1YnKynaIiF7aX4OU -suffix 01
17:41:09:I1:WU40:             -version 8.3.18 -lifeline 1299 -gpu-vendor nvidia -opencl-platform
17:41:09:I1:WU40:             0 -opencl-device 0 -cuda-platform 0 -cuda-device 0 -gpu 0
17:41:09:I1:WU40:************************************ libFAH ************************************
17:41:09:I1:WU40:       Date: Jul 25 2024
17:41:09:I1:WU40:       Time: 05:13:14
17:41:09:I1:WU40:   Revision: c7d2824a47eb025fa8cda8968c7a5e971585d90c
17:41:09:I1:WU40:     Branch: HEAD
17:41:09:I1:WU40:   Compiler: GNU 7.5.0
17:41:09:I1:WU40:    Options: -faligned-new -std=c++11 -fsigned-char -ffunction-sections
17:41:09:I1:WU40:             -fdata-sections -O3 -funroll-loops -fno-pie
17:41:09:I1:WU40:   Platform: linux 6.5.0-1024-azure
17:41:09:I1:WU40:       Bits: 64
17:41:09:I1:WU40:       Mode: Release
17:41:09:I1:WU40:************************************ CBang *************************************
17:41:09:I1:WU40:    Version: 1.7.2
17:41:09:I1:WU40:     Author: Joseph Coffland <joseph@cauldrondevelopment.com>
17:41:09:I1:WU40:        Org: Cauldron Development LLC
17:41:09:I1:WU40:  Copyright: Cauldron Development LLC, 2003-2024
17:41:09:I1:WU40:   Homepage: https://cauldrondevelopment.com/
17:41:09:I1:WU40:    License: LGPL-2.1-or-later
17:41:09:I1:WU40:       Date: Jul 25 2024
17:41:09:I1:WU40:       Time: 05:12:47
17:41:09:I1:WU40:   Revision: f1cd4c791e8c40a35dcfeab3ab85d910949cc0cb
17:41:09:I1:WU40:     Branch: HEAD
17:41:09:I1:WU40:   Compiler: GNU 7.5.0
17:41:09:I1:WU40:    Options: -faligned-new -std=c++11 -fsigned-char -ffunction-sections
17:41:09:I1:WU40:             -fdata-sections -O3 -funroll-loops -fno-pie -fPIC
17:41:09:I1:WU40:   Platform: linux 6.5.0-1024-azure
17:41:09:I1:WU40:       Bits: 64
17:41:09:I1:WU40:       Mode: Release
17:41:09:I1:WU40:************************************ System ************************************
17:41:09:I1:WU40:        CPU: Intel(R) Core(TM) i5-3550 CPU @ 3.30GHz
17:41:09:I1:WU40:     CPU ID: GenuineIntel Family 6 Model 58 Stepping 9
17:41:09:I1:WU40:       CPUs: 4
17:41:09:I1:WU40:     Memory: 15.57GiB
17:41:09:I1:WU40:Free Memory: 10.00GiB
17:41:09:I1:WU40: OS Version: 6.8
17:41:09:I1:WU40:Has Battery: false
17:41:09:I1:WU40: On Battery: false
17:41:09:I1:WU40:   Hostname: amandine-MS-7751
17:41:09:I1:WU40: UTC Offset: 2
17:41:09:I1:WU40:        PID: 8605
17:41:09:I1:WU40:        CWD: /var/lib/fah-client/work
17:41:09:I1:WU40:       Exec: /var/lib/fah-client/cores/openmm-core-24/centos-7.9.2009-64bit/release/fahcore-24-centos-7.9.2009-64bit-release-8.1.4/FahCore_24
17:41:09:I1:WU40:************************************ OpenMM ************************************
17:41:09:I1:WU40:    Version: 8.1.1
17:41:09:I1:WU40:********************************************************************************
17:41:09:I1:WU40:Project: 18237 (Run 712, Clone 0, Gen 40)
17:41:09:I1:WU40:Reading tar file core.xml
17:41:09:I1:WU40:Reading tar file integrator.xml
17:41:09:I1:WU40:Reading tar file state.xml.bz2
17:41:09:I1:WU40:Reading tar file system.xml.bz2
17:41:09:I1:WU40:Digital signatures verified
17:41:09:I1:WU40:Folding@home GPU Core24 Folding@home Core
17:41:09:I1:WU40:Version 8.1.4
17:41:09:I1:WU40:  Checkpoint write interval: 50000 steps (2%) [50 total]
17:41:09:I1:WU40:  JSON viewer frame write interval: 25000 steps (1%) [100 total]
17:41:09:I1:WU40:  XTC frame write interval: 10000 steps (0.4%) [250 total]
17:41:09:I1:WU40:  TRR frame write interval: disabled
17:41:09:I1:WU40:  Global context and integrator variables write interval: disabled
17:41:09:I1:WU40:There are 4 platforms available.
17:41:09:I1:WU40:Platform 0: Reference
17:41:09:I1:WU40:Platform 1: CPU
17:41:09:I1:WU40:Platform 2: OpenCL
17:41:09:I1:WU40:  opencl-device 0 specified
17:41:09:I1:WU40:Platform 3: CUDA
17:41:09:I1:WU40:  cuda-device 0 specified
17:41:15:I1:WU40:Attempting to create CUDA context:
17:41:15:I1:WU40:  Configuring platform CUDA
17:41:21:I1:WU40:  Using CUDA on CUDA Platform and gpu 0
17:41:21:I1:WU40:  GPU info: Platform: CUDA
17:41:21:I1:WU40:  GPU info: PlatformIndex: 0
17:41:21:I1:WU40:  GPU info: Device: NVIDIA GeForce RTX 4080 SUPER
17:41:21:I1:WU40:  GPU info: DeviceIndex: 0
17:41:21:I1:WU40:  GPU info: Vendor: 0x10de
17:41:21:I1:WU40:  GPU info: PCI: 01:00:00
17:41:21:I1:WU40:  GPU info: Compute: 8.9
17:41:21:I1:WU40:  GPU info: Driver: 12.4
17:41:21:I1:WU40:  GPU info: GPU: true
17:41:21:I1:WU40:Completed 0 out of 2500000 steps (0%)
17:41:21:I1:WU40:Checkpoint completed at step 0
17:41:54:I1:WU40:Completed 25000 out of 2500000 steps (1%)
17:42:27:I1:WU40:Completed 50000 out of 2500000 steps (2%)
[…]
17:55:32:I1:WU40:Checkpoint completed at step 650000
17:56:04:I1:WU40:Completed 675000 out of 2500000 steps (27%)
17:56:37:I1:WU40:Completed 700000 out of 2500000 steps (28%)
17:56:37:I1:WU40:An exception occurred at step 700000: Force RMSE error of 11.7448 with threshold of 10
17:56:37:I1:WU40:ERROR:98: Attempting to restart from last good checkpoint by restarting core.
17:56:37:I1:WU40:Folding@home Core Shutdown: CORE_RESTART
#[93m17:56:38:W :WU40:Core returned CORE_RESTART (98)#[0m
17:56:38:I1:Default:Added new work unit: cpus:1 gpus:gpu:01:00:00
17:56:38:I1:WU40:Sending dump report
17:56:38:I1:WU41:Requesting WU assignment for user Nicolas_orleans team 33
17:56:38:I1:OUT14:> POST https://highland1.seas.upenn.edu/api/results HTTP/1.1
17:56:38:I1:OUT15:> POST https://assign5.foldingathome.org/api/assign HTTP/1.1
17:56:38:I1:OUT14:< HTTP/1.1 200 HTTP_OK
17:56:38:I1:WU40:Dumped
17:56:38:I1:OUT15:< HTTP/1.1 200 HTTP_OK
We see here https://apps.foldingathome.org/wu#proje ... e=0&gen=40 it failed 8 times before being completed, though 6 failed with no runtime so could be driver/CUDA 12 not available related, meaning it failed best case twice before being completed. Will look in the other logs...

I don't know why this happens "only" (for my machine) with this specific project.
PaulTV
Posts: 210
Joined: Mon Jan 25, 2021 4:53 pm
Location: Netherlands

Re: Corrupted / bad job 18237/1069/0/71 (failing for all users)

Post by PaulTV »

Hey Nicolas

If you frequently see 'attempting to restart from last good checkpoint' and then see the job continue, that may indicate the rig needs some maintenance (e.g. cleaning), or a possible hardware issue. I saw those messages now and then on another rig than my main one, but after a folding pause in the summer, and thorough cleaning, it's folding fine the last couple weeks.

If a job encounters an error too often this way, it'll be dropped (I don't know the threshold).

Hey Joe,

Thanks for that!
Image

Ryzen 9800X3D / RTX 4090 / Windows 11
Ryzen 5600X / RTX 3070 Ti / Ubuntu 22.04
Ryzen 5600 / RTX 3060 Ti / Windows 11
jjmiller
Scientist
Posts: 139
Joined: Fri Apr 09, 2021 4:43 pm

Re: Corrupted / bad job 18237/1069/0/71 (failing for all users)

Post by jjmiller »

Hi all,

Thanks for the reports. We're in a bit of a pickle with core 0x24 projects at the moment. As folks have suggested above, after 5 failed attempts a WU will no longer be sent out. Currently on 0x24 projects we're seeing many WUs failing because there's what seems to be a mismatch between how the FAH Client and OpenMM talk to one another. In these cases, the WU fails as FAH Client attempts to initialize the WU, not because the WU itself is bad. The error codes are predominantly one of the following:
  • ERROR:125: Failed to create a GPU-enabled OpenMM context
  • ERROR:126: Neither CUDA nor OpenCL is available
These failures accumulate rapidly and tank otherwise stable projects before any data can be collected. Accordingly, I have been periodically resetting the error counts on my 0x24 projects to try and actually collect data on the WUs that are stable but fell victim to ERROR125/126s. Unfortunately, there are a few WUs that have legitimately reached problematic/unstable states (e.g. 18237/1069/0/71). At the moment, it's very hard on our end to discriminate between legitimate failures and failures that are due to ERROR125/126. We have both the FAH developer and the OpenMM core developers working to get a fix out on this, but it's proven a bit difficult.

I'll go in and manually pull 18237/1069/0/71. If folks see other instances of unstable states I'm happy to go in and manually pull them as well. Apologies for the problematic WUs and thanks for folding.
Last edited by jjmiller on Thu Nov 07, 2024 8:34 pm, edited 1 time in total.
Nicolas_orleans
Posts: 114
Joined: Wed Aug 08, 2012 3:08 am

Re: Corrupted / bad job 18237/1069/0/71 (failing for all users)

Post by Nicolas_orleans »

Hi Paul,

It's a brand new card and, again, it's only for this particular project, I would be more hardware-focused if it happened with the 16 other projects I am currently being assigned ? It only happens with this one.

Regarding Core24, it runs without any error on all P18230 WUs received so far, but not with P18237 on my rig.

I don't want to hijack this thread, but sharing a candid question with you: Core22 scales great on my 4080 Super (like 92-94% GPU utilization), Core23 scales fantastically (96-100% GPU utilization), but Core24 does not (like < 90% GPU utilization most of the time). Any reason for that, is it a "regression" in recent OpenMM versions ? Do you see it also with your 4090 ?
PaulTV
Posts: 210
Joined: Mon Jan 25, 2021 4:53 pm
Location: Netherlands

Re: Corrupted / bad job 18237/1069/0/71 (failing for all users)

Post by PaulTV »

Hmmm... I don't really watch utilization numbers. I think that may have (more?) to do with the number of atoms in a project - lower numbers won't keep all cuda cores busy.
Image

Ryzen 9800X3D / RTX 4090 / Windows 11
Ryzen 5600X / RTX 3070 Ti / Ubuntu 22.04
Ryzen 5600 / RTX 3060 Ti / Windows 11
Andre_Ti
Posts: 35
Joined: Sat Mar 21, 2020 7:51 am

Re: Corrupted / bad job 18237/1069/0/71 (failing for all users)

Post by Andre_Ti »

Another broken WU - Project: 18237 (Run 904, Clone 1, Gen 32).

Code: Select all

18:15:07:WU01:FS01:Connecting to assign1.foldingathome.org:80
18:15:08:WU01:FS01:Assigned to work server 158.130.118.23
18:15:08:WU01:FS01:Requesting new work unit for slot 01: gpu:1:0 GA104 [GeForce RTX 3070] from 158.130.118.23
18:15:08:WU01:FS01:Connecting to 158.130.118.23:8080
18:15:09:WU01:FS01:Downloading 10.84MiB
18:15:13:WU00:FS01:0x24:Saving result file ..\\logfile_01.txt
18:15:13:WU00:FS01:0x24:Saving result file checkpointIntegrator.xml
18:15:13:WU00:FS01:0x24:Saving result file checkpointState.xml.bz2
18:15:13:WU00:FS01:0x24:Saving result file positions.xtc
18:15:13:WU00:FS01:0x24:Saving result file science.log
18:15:13:WU00:FS01:0x24:Saving result file xtcAtoms.csv.bz2
18:15:13:WU00:FS01:0x24:Folding@home Core Shutdown: FINISHED_UNIT
18:15:14:WU00:FS01:FahCore returned: FINISHED_UNIT (100 = 0x64)
18:15:14:WU00:FS01:Sending unit results: id:00 state:SEND error:NO_ERROR project:18237 run:1026 clone:0 gen:94 core:0x24 unit:0x000000000000005e0000473d00000402
18:15:14:WU00:FS01:Uploading 12.54MiB to 158.130.118.23
18:15:14:WU00:FS01:Connecting to 158.130.118.23:8080
18:15:15:WU01:FS01:Download 81.26%
18:15:16:WU01:FS01:Download complete
18:15:16:WU01:FS01:Received Unit: id:01 state:DOWNLOAD error:NO_ERROR project:18237 run:904 clone:1 gen:32 core:0x24 unit:0x00000001000000200000473d00000388
18:15:16:WU01:FS01:Starting
18:15:16:WU01:FS01:Running FahCore: \"C:\\Program Files (x86)\\FAHClient/FAHCoreWrapper.exe\" C:\\ProgramData\\FAHClient\\cores/cores.foldingathome.org/openmm-core-24/windows-10-64bit/release/0x24-8.1.4/Core_24.fah/FahCore_24.exe -dir 01 -suffix 01 -version 706 -lifeline 5872 -checkpoint 15 -opencl-platform 0 -opencl-device 0 -cuda-device 0 -gpu-vendor nvidia -gpu 0 -gpu-usage 100
18:15:16:WU01:FS01:Started FahCore on PID 6976
18:15:16:WU01:FS01:Core PID:2052
18:15:16:WU01:FS01:FahCore 0x24 started
18:15:17:WU01:FS01:0x24:*********************** Log Started 2024-11-08T18:15:16Z ***********************
18:15:17:WU01:FS01:0x24:*************************** Core24 Folding@home Core ***************************
18:15:17:WU01:FS01:0x24:       Core: Core24
18:15:17:WU01:FS01:0x24:       Type: 0x24
18:15:17:WU01:FS01:0x24:    Version: 8.1.4
18:15:17:WU01:FS01:0x24:     Author: Joseph Coffland <joseph@cauldrondevelopment.com>
18:15:17:WU01:FS01:0x24:  Copyright: 2022 foldingathome.org
18:15:17:WU01:FS01:0x24:   Homepage: https://foldingathome.org/
18:15:17:WU01:FS01:0x24:       Date: Jul 25 2024
18:15:17:WU01:FS01:0x24:       Time: 05:42:49
18:15:17:WU01:FS01:0x24:   Revision: cf9f0139862b8945a2091772770e4631aac37792
18:15:17:WU01:FS01:0x24:     Branch: HEAD
18:15:17:WU01:FS01:0x24:   Compiler: Visual C++
18:15:17:WU01:FS01:0x24:    Options: $( /TP $) /std:c++14 /nologo /EHa /wd4297 /wd4103 /O2
18:15:17:WU01:FS01:0x24:             /Zc:throwingNew /MT -DOPENMM_VERSION=\"\\\"8.1.1\\\"\" /Ox /std:c++14
18:15:17:WU01:FS01:0x24:   Platform: win32 10
18:15:17:WU01:FS01:0x24:       Bits: 64
18:15:17:WU01:FS01:0x24:       Mode: Release
18:15:17:WU01:FS01:0x24:Maintainers: John Chodera <john.chodera@choderalab.org> and Peter Eastman
18:15:17:WU01:FS01:0x24:             <peastman@stanford.edu>
18:15:17:WU01:FS01:0x24:       Args: -dir 01 -suffix 01 -version 706 -lifeline 6976 -checkpoint 15
18:15:17:WU01:FS01:0x24:             -opencl-platform 0 -opencl-device 0 -cuda-device 0 -gpu-vendor
18:15:17:WU01:FS01:0x24:             nvidia -gpu 0 -gpu-usage 100
18:15:17:WU01:FS01:0x24:************************************ libFAH ************************************
18:15:17:WU01:FS01:0x24:       Date: Jul 25 2024
18:15:17:WU01:FS01:0x24:       Time: 05:23:50
18:15:17:WU01:FS01:0x24:   Revision: c7d2824a47eb025fa8cda8968c7a5e971585d90c
18:15:17:WU01:FS01:0x24:     Branch: HEAD
18:15:17:WU01:FS01:0x24:   Compiler: Visual C++
18:15:17:WU01:FS01:0x24:    Options: $( /TP $) /nologo /EHa /wd4297 /wd4103 /O2 /Zc:throwingNew /MT
18:15:17:WU01:FS01:0x24:   Platform: win32 10
18:15:17:WU01:FS01:0x24:       Bits: 64
18:15:17:WU01:FS01:0x24:       Mode: Release
18:15:17:WU01:FS01:0x24:************************************ CBang *************************************
18:15:17:WU01:FS01:0x24:    Version: 1.7.2
18:15:17:WU01:FS01:0x24:     Author: Joseph Coffland <joseph@cauldrondevelopment.com>
18:15:17:WU01:FS01:0x24:        Org: Cauldron Development LLC
18:15:17:WU01:FS01:0x24:  Copyright: Cauldron Development LLC, 2003-2024
18:15:17:WU01:FS01:0x24:   Homepage: https://cauldrondevelopment.com/
18:15:17:WU01:FS01:0x24:    License: LGPL-2.1-or-later
18:15:17:WU01:FS01:0x24:       Date: Jul 25 2024
18:15:17:WU01:FS01:0x24:       Time: 05:22:43
18:15:17:WU01:FS01:0x24:   Revision: f1cd4c791e8c40a35dcfeab3ab85d910949cc0cb
18:15:17:WU01:FS01:0x24:     Branch: HEAD
18:15:17:WU01:FS01:0x24:   Compiler: Visual C++
18:15:17:WU01:FS01:0x24:    Options: $( /TP $) /nologo /EHa /wd4297 /wd4103 /O2 /Zc:throwingNew /MT
18:15:17:WU01:FS01:0x24:   Platform: win32 10
18:15:17:WU01:FS01:0x24:       Bits: 64
18:15:17:WU01:FS01:0x24:       Mode: Release
18:15:17:WU01:FS01:0x24:************************************ System ************************************
18:15:17:WU01:FS01:0x24:        CPU: Intel(R) Core(TM) i7-6700K CPU @ 4.00GHz
18:15:17:WU01:FS01:0x24:     CPU ID: GenuineIntel Family 6 Model 94 Stepping 3
18:15:17:WU01:FS01:0x24:       CPUs: 8
18:15:17:WU01:FS01:0x24:     Memory: 15.94GiB
18:15:17:WU01:FS01:0x24:Free Memory: 12.05GiB
18:15:17:WU01:FS01:0x24: OS Version: 10.0
18:15:17:WU01:FS01:0x24:Has Battery: false
18:15:17:WU01:FS01:0x24: On Battery: false
18:15:17:WU01:FS01:0x24:   Hostname: AndrePC
18:15:17:WU01:FS01:0x24: UTC Offset: 3
18:15:17:WU01:FS01:0x24:        PID: 2052
18:15:17:WU01:FS01:0x24:        CWD: C:\\ProgramData\\FAHClient\\work
18:15:17:WU01:FS01:0x24:       Exec: C:\\ProgramData\\FAHClient\\cores\\cores.foldingathome.org\\openmm-core-24\\windows-10-64bit\\release\\0x24-8.1.4\\Core_24.fah\\FahCore_24.exe
18:15:17:WU01:FS01:0x24:************************************ OpenMM ************************************
18:15:17:WU01:FS01:0x24:    Version: 8.1.1
18:15:17:WU01:FS01:0x24:********************************************************************************
18:15:17:WU01:FS01:0x24:Project: 18237 (Run 904, Clone 1, Gen 32)
18:15:17:WU01:FS01:0x24:Reading tar file core.xml
18:15:17:WU01:FS01:0x24:Reading tar file integrator.xml
18:15:17:WU01:FS01:0x24:Reading tar file state.xml.bz2
18:15:17:WU01:FS01:0x24:Reading tar file system.xml.bz2
18:15:17:WU01:FS01:0x24:Digital signatures verified
18:15:17:WU01:FS01:0x24:Folding@home GPU Core24 Folding@home Core
18:15:17:WU01:FS01:0x24:Version 8.1.4
18:15:17:WU01:FS01:0x24:  Checkpoint write interval: 50000 steps (2%) [50 total]
18:15:17:WU01:FS01:0x24:  JSON viewer frame write interval: 25000 steps (1%) [100 total]
18:15:17:WU01:FS01:0x24:  XTC frame write interval: 10000 steps (0.4%) [250 total]
18:15:17:WU01:FS01:0x24:  TRR frame write interval: disabled
18:15:17:WU01:FS01:0x24:  Global context and integrator variables write interval: disabled
18:15:17:WU01:FS01:0x24:There are 4 platforms available.
18:15:17:WU01:FS01:0x24:Platform 0: Reference
18:15:17:WU01:FS01:0x24:Platform 1: CPU
18:15:17:WU01:FS01:0x24:Platform 2: OpenCL
18:15:17:WU01:FS01:0x24:  opencl-device 0 specified
18:15:17:WU01:FS01:0x24:Platform 3: CUDA
18:15:17:WU01:FS01:0x24:  cuda-device 0 specified
18:15:25:WU00:FS01:Upload 26.91%
18:15:27:WU01:FS01:0x24:Attempting to create CUDA context:
18:15:27:WU01:FS01:0x24:  Configuring platform CUDA
18:15:30:WU01:FS01:0x24:ERROR:Potential energy error of 425.18, threshold of 20
18:15:30:WU01:FS01:0x24:ERROR:Reference Potential Energy: -1.94968e+06 | Given Potential Energy: -1.95011e+06
18:15:30:WU01:FS01:0x24:Saving result file ..\\logfile_01.txt
18:15:30:WU01:FS01:0x24:Saving result file science.log
18:15:30:WU01:FS01:0x24:Saving result file state.xml.bz2
18:15:30:WU01:FS01:0x24:Folding@home Core Shutdown: BAD_WORK_UNIT
18:15:31:WU00:FS01:Upload 46.34%
18:15:31:WARNING:WU01:FS01:FahCore returned: BAD_WORK_UNIT (114 = 0x72)
18:15:31:WU01:FS01:Sending unit results: id:01 state:SEND error:FAULTY project:18237 run:904 clone:1 gen:32 core:0x24 unit:0x00000001000000200000473d00000388
18:15:31:WU01:FS01:Uploading 9.70MiB to 158.130.118.23
18:15:31:WU01:FS01:Connecting to 158.130.118.23:8080
18:15:31:WU02:FS01:Connecting to assign1.foldingathome.org:80
18:15:32:WU02:FS01:Assigned to work server 158.130.118.26
18:15:32:WU02:FS01:Requesting new work unit for slot 01: gpu:1:0 GA104 [GeForce RTX 3070] from 158.130.118.26
18:15:32:WU02:FS01:Connecting to 158.130.118.26:8080
18:15:32:WU02:FS01:Downloading 23.42MiB
18:15:37:WU00:FS01:Upload 79.72%
18:15:38:WU02:FS01:Download 37.36%
18:15:38:WU01:FS01:Upload 47.69%
18:15:44:WU02:FS01:Download 88.07%
18:15:44:WU00:FS01:Upload complete
18:15:44:WU00:FS01:Server responded WORK_ACK (400)
18:15:44:WU00:FS01:Final credit estimate, 467700.00 points
18:15:44:WU00:FS01:Cleaning up
18:15:45:WU02:FS01:Download complete
Last edited by Andre_Ti on Fri Nov 08, 2024 7:52 pm, edited 1 time in total.
Image
Andre_Ti
Posts: 35
Joined: Sat Mar 21, 2020 7:51 am

Re: Corrupted / bad job 18237/1069/0/71 (failing for all users)

Post by Andre_Ti »

Project: 18237 (Run 899, Clone 0, Gen 44)
Unbelievable, this has been going on since 2024-09-29 15:45:01.

Code: Select all

15:28:23:WU00:FS01:Assigned to work server 158.130.118.23
15:28:23:WU00:FS01:Requesting new work unit for slot 01: gpu:1:0 AD102 [GeForce RTX 4090] from 158.130.118.23
15:28:23:WU00:FS01:Connecting to 158.130.118.23:8080
15:28:24:WU00:FS01:Downloading 10.84MiB
15:28:27:WU00:FS01:Download complete
15:28:28:WU00:FS01:Received Unit: id:00 state:DOWNLOAD error:NO_ERROR project:18237 run:899 clone:0 gen:44 core:0x24 unit:0x000000000000002c0000473d00000383
15:28:37:WU01:FS01:0x24:Saving result file ..\\logfile_01.txt
15:28:37:WU01:FS01:0x24:Saving result file checkpointIntegrator.xml
15:28:37:WU01:FS01:0x24:Saving result file checkpointState.xml.bz2
15:28:37:WU01:FS01:0x24:Saving result file positions.xtc
15:28:37:WU01:FS01:0x24:Saving result file science.log
15:28:37:WU01:FS01:0x24:Saving result file xtcAtoms.csv.bz2
15:28:37:WU01:FS01:0x24:Folding@home Core Shutdown: FINISHED_UNIT
15:28:37:WU01:FS01:FahCore returned: FINISHED_UNIT (100 = 0x64)
15:28:38:WU01:FS01:Sending unit results: id:01 state:SEND error:NO_ERROR project:18230 run:108 clone:3 gen:85 core:0x24 unit:0x0000000300000055000047360000006c
15:28:38:WU01:FS01:Uploading 35.76MiB to 158.130.118.25
15:28:38:WU00:FS01:Starting
15:28:38:WU01:FS01:Connecting to 158.130.118.25:8080
15:28:38:WU00:FS01:Running FahCore: \"C:\\Program Files (x86)\\FAHClient/FAHCoreWrapper.exe\" C:\\ProgramData\\FAHClient\\cores/cores.foldingathome.org/openmm-core-24/windows-10-64bit/release/0x24-8.1.4/Core_24.fah/FahCore_24.exe -dir 00 -suffix 01 -version 706 -lifeline 9060 -checkpoint 15 -opencl-platform 0 -opencl-device 0 -cuda-device 0 -gpu-vendor nvidia -gpu 0 -gpu-usage 100
15:28:38:WU00:FS01:Started FahCore on PID 7600
15:28:38:WU00:FS01:Core PID:10232
15:28:38:WU00:FS01:FahCore 0x24 started
15:28:38:WU00:FS01:0x24:*********************** Log Started 2024-11-06T15:28:38Z ***********************
15:28:38:WU00:FS01:0x24:*************************** Core24 Folding@home Core ***************************
15:28:38:WU00:FS01:0x24:       Core: Core24
15:28:38:WU00:FS01:0x24:       Type: 0x24
15:28:38:WU00:FS01:0x24:    Version: 8.1.4
15:28:38:WU00:FS01:0x24:     Author: Joseph Coffland <joseph@cauldrondevelopment.com>
15:28:38:WU00:FS01:0x24:  Copyright: 2022 foldingathome.org
15:28:38:WU00:FS01:0x24:   Homepage: https://foldingathome.org/
15:28:38:WU00:FS01:0x24:       Date: Jul 25 2024
15:28:38:WU00:FS01:0x24:       Time: 05:42:49
15:28:38:WU00:FS01:0x24:   Revision: cf9f0139862b8945a2091772770e4631aac37792
15:28:38:WU00:FS01:0x24:     Branch: HEAD
15:28:38:WU00:FS01:0x24:   Compiler: Visual C++
15:28:38:WU00:FS01:0x24:    Options: $( /TP $) /std:c++14 /nologo /EHa /wd4297 /wd4103 /O2
15:28:38:WU00:FS01:0x24:             /Zc:throwingNew /MT -DOPENMM_VERSION=\"\\\"8.1.1\\\"\" /Ox /std:c++14
15:28:38:WU00:FS01:0x24:   Platform: win32 10
15:28:38:WU00:FS01:0x24:       Bits: 64
15:28:38:WU00:FS01:0x24:       Mode: Release
15:28:38:WU00:FS01:0x24:Maintainers: John Chodera <john.chodera@choderalab.org> and Peter Eastman
15:28:38:WU00:FS01:0x24:             <peastman@stanford.edu>
15:28:38:WU00:FS01:0x24:       Args: -dir 00 -suffix 01 -version 706 -lifeline 7600 -checkpoint 15
15:28:38:WU00:FS01:0x24:             -opencl-platform 0 -opencl-device 0 -cuda-device 0 -gpu-vendor
15:28:38:WU00:FS01:0x24:             nvidia -gpu 0 -gpu-usage 100
15:28:38:WU00:FS01:0x24:************************************ libFAH ************************************
15:28:38:WU00:FS01:0x24:       Date: Jul 25 2024
15:28:38:WU00:FS01:0x24:       Time: 05:23:50
15:28:38:WU00:FS01:0x24:   Revision: c7d2824a47eb025fa8cda8968c7a5e971585d90c
15:28:38:WU00:FS01:0x24:     Branch: HEAD
15:28:38:WU00:FS01:0x24:   Compiler: Visual C++
15:28:38:WU00:FS01:0x24:    Options: $( /TP $) /nologo /EHa /wd4297 /wd4103 /O2 /Zc:throwingNew /MT
15:28:38:WU00:FS01:0x24:   Platform: win32 10
15:28:38:WU00:FS01:0x24:       Bits: 64
15:28:38:WU00:FS01:0x24:       Mode: Release
15:28:38:WU00:FS01:0x24:************************************ CBang *************************************
15:28:38:WU00:FS01:0x24:    Version: 1.7.2
15:28:38:WU00:FS01:0x24:     Author: Joseph Coffland <joseph@cauldrondevelopment.com>
15:28:38:WU00:FS01:0x24:        Org: Cauldron Development LLC
15:28:38:WU00:FS01:0x24:  Copyright: Cauldron Development LLC, 2003-2024
15:28:38:WU00:FS01:0x24:   Homepage: https://cauldrondevelopment.com/
15:28:38:WU00:FS01:0x24:    License: LGPL-2.1-or-later
15:28:38:WU00:FS01:0x24:       Date: Jul 25 2024
15:28:38:WU00:FS01:0x24:       Time: 05:22:43
15:28:38:WU00:FS01:0x24:   Revision: f1cd4c791e8c40a35dcfeab3ab85d910949cc0cb
15:28:38:WU00:FS01:0x24:     Branch: HEAD
15:28:38:WU00:FS01:0x24:   Compiler: Visual C++
15:28:38:WU00:FS01:0x24:    Options: $( /TP $) /nologo /EHa /wd4297 /wd4103 /O2 /Zc:throwingNew /MT
15:28:38:WU00:FS01:0x24:   Platform: win32 10
15:28:38:WU00:FS01:0x24:       Bits: 64
15:28:38:WU00:FS01:0x24:       Mode: Release
15:28:38:WU00:FS01:0x24:************************************ System ************************************
15:28:38:WU00:FS01:0x24:        CPU: 12th Gen Intel(R) Core(TM) i7-12700F
15:28:38:WU00:FS01:0x24:     CPU ID: GenuineIntel Family 6 Model 151 Stepping 2
15:28:38:WU00:FS01:0x24:       CPUs: 16
15:28:38:WU00:FS01:0x24:     Memory: 63.89GiB
15:28:38:WU00:FS01:0x24:Free Memory: 57.40GiB
15:28:38:WU00:FS01:0x24: OS Version: 10.0
15:28:38:WU00:FS01:0x24:Has Battery: false
15:28:38:WU00:FS01:0x24: On Battery: false
15:28:38:WU00:FS01:0x24:   Hostname: AndrePK
15:28:38:WU00:FS01:0x24: UTC Offset: 3
15:28:38:WU00:FS01:0x24:        PID: 10232
15:28:38:WU00:FS01:0x24:        CWD: C:\\ProgramData\\FAHClient\\work
15:28:38:WU00:FS01:0x24:       Exec: C:\\ProgramData\\FAHClient\\cores\\cores.foldingathome.org\\openmm-core-24\\windows-10-64bit\\release\\0x24-8.1.4\\Core_24.fah\\FahCore_24.exe
15:28:38:WU00:FS01:0x24:************************************ OpenMM ************************************
15:28:38:WU00:FS01:0x24:    Version: 8.1.1
15:28:38:WU00:FS01:0x24:********************************************************************************
15:28:38:WU00:FS01:0x24:Project: 18237 (Run 899, Clone 0, Gen 44)
15:28:38:WU00:FS01:0x24:Reading tar file core.xml
15:28:38:WU00:FS01:0x24:Reading tar file integrator.xml
15:28:38:WU00:FS01:0x24:Reading tar file state.xml.bz2
15:28:38:WU00:FS01:0x24:Reading tar file system.xml.bz2
15:28:38:WU00:FS01:0x24:Digital signatures verified
15:28:38:WU00:FS01:0x24:Folding@home GPU Core24 Folding@home Core
15:28:38:WU00:FS01:0x24:Version 8.1.4
15:28:38:WU00:FS01:0x24:  Checkpoint write interval: 50000 steps (2%) [50 total]
15:28:38:WU00:FS01:0x24:  JSON viewer frame write interval: 25000 steps (1%) [100 total]
15:28:38:WU00:FS01:0x24:  XTC frame write interval: 10000 steps (0.4%) [250 total]
15:28:38:WU00:FS01:0x24:  TRR frame write interval: disabled
15:28:38:WU00:FS01:0x24:  Global context and integrator variables write interval: disabled
15:28:38:WU00:FS01:0x24:There are 4 platforms available.
15:28:38:WU00:FS01:0x24:Platform 0: Reference
15:28:38:WU00:FS01:0x24:Platform 1: CPU
15:28:38:WU00:FS01:0x24:Platform 2: OpenCL
15:28:38:WU00:FS01:0x24:  opencl-device 0 specified
15:28:38:WU00:FS01:0x24:Platform 3: CUDA
15:28:38:WU00:FS01:0x24:  cuda-device 0 specified
15:28:44:WU01:FS01:Upload 6.12%
15:28:45:WU00:FS01:0x24:Attempting to create CUDA context:
15:28:45:WU00:FS01:0x24:  Configuring platform CUDA
15:28:46:WU00:FS01:0x24:ERROR:Potential energy error of 248.278, threshold of 20
15:28:46:WU00:FS01:0x24:ERROR:Reference Potential Energy: -1.94865e+06 | Given Potential Energy: -1.9489e+06
15:28:46:WU00:FS01:0x24:Saving result file ..\\logfile_01.txt
15:28:46:WU00:FS01:0x24:Saving result file science.log
15:28:46:WU00:FS01:0x24:Saving result file state.xml.bz2
15:28:46:WU00:FS01:0x24:Folding@home Core Shutdown: BAD_WORK_UNIT
15:28:47:WARNING:WU00:FS01:FahCore returned: BAD_WORK_UNIT (114 = 0x72)
15:28:47:WU00:FS01:Sending unit results: id:00 state:SEND error:FAULTY project:18237 run:899 clone:0 gen:44 core:0x24 unit:0x000000000000002c0000473d00000383
15:28:47:WU00:FS01:Uploading 9.69MiB to 158.130.118.23
15:28:47:WU00:FS01:Connecting to 158.130.118.23:8080
15:28:47:WU02:FS01:Connecting to assign1.foldingathome.org:80
15:28:48:WU02:FS01:Assigned to work server 158.130.118.26
15:28:48:WU02:FS01:Requesting new work unit for slot 01: gpu:1:0 AD102 [GeForce RTX 4090] from 158.130.118.26
15:28:48:WU02:FS01:Connecting to 158.130.118.26:8080
15:28:49:WU02:FS01:Downloading 29.64MiB
15:28:50:WU01:FS01:Upload 11.36%
15:28:53:WU00:FS01:Upload 30.95%
15:28:55:WU02:FS01:Download 68.11%
15:28:56:WU01:FS01:Upload 27.79%
15:28:59:WU02:FS01:Download complete
15:28:59:WU00:FS01:Upload 52.23%
Last edited by Andre_Ti on Fri Nov 08, 2024 7:59 pm, edited 2 times in total.
Image
Andre_Ti
Posts: 35
Joined: Sat Mar 21, 2020 7:51 am

Re: Corrupted / bad job 18237/1069/0/71 (failing for all users)

Post by Andre_Ti »

Project: 18237 (Run 648, Clone 1, Gen 0)
Even earlier 2024-09-25 14:22:01.
Are you sure this project is important to you?

Code: Select all

16:13:21:WU00:FS01:Assigned to work server 158.130.118.23
16:13:21:WU00:FS01:Requesting new work unit for slot 01: gpu:1:0 AD102 [GeForce RTX 4090] from 158.130.118.23
16:13:21:WU00:FS01:Connecting to 158.130.118.23:8080
16:13:21:WU00:FS01:Downloading 10.72MiB
16:13:26:WU00:FS01:Download complete
16:13:26:WU00:FS01:Received Unit: id:00 state:DOWNLOAD error:NO_ERROR project:18237 run:648 clone:1 gen:0 core:0x24 unit:0x00000001000000000000473d00000288
16:13:30:WU02:FS01:0x23:Saving result file ..\\logfile_01.txt
16:13:30:WU02:FS01:0x23:Saving result file checkpointIntegrator.xml
16:13:30:WU02:FS01:0x23:Saving result file checkpointState.xml.bz2
16:13:31:WU02:FS01:0x23:Saving result file positions.xtc
16:13:31:WU02:FS01:0x23:Saving result file science.log
16:13:31:WU02:FS01:0x23:Saving result file xtcAtoms.csv.bz2
16:13:31:WU02:FS01:0x23:Folding@home Core Shutdown: FINISHED_UNIT
16:13:31:WU02:FS01:FahCore returned: FINISHED_UNIT (100 = 0x64)
16:13:31:WU02:FS01:Sending unit results: id:02 state:SEND error:NO_ERROR project:18228 run:429 clone:2 gen:80 core:0x23 unit:0x000000020000005000004734000001ad
16:13:31:WU02:FS01:Uploading 30.28MiB to 158.130.118.26
16:13:31:WU00:FS01:Starting
16:13:31:WU02:FS01:Connecting to 158.130.118.26:8080
16:13:31:WU00:FS01:Running FahCore: \"C:\\Program Files (x86)\\FAHClient/FAHCoreWrapper.exe\" C:\\ProgramData\\FAHClient\\cores/cores.foldingathome.org/openmm-core-24/windows-10-64bit/release/0x24-8.1.4/Core_24.fah/FahCore_24.exe -dir 00 -suffix 01 -version 706 -lifeline 9060 -checkpoint 15 -opencl-platform 0 -opencl-device 0 -cuda-device 0 -gpu-vendor nvidia -gpu 0 -gpu-usage 100
16:13:31:WU00:FS01:Started FahCore on PID 6888
16:13:31:WU00:FS01:Core PID:5892
16:13:31:WU00:FS01:FahCore 0x24 started
16:13:32:WU00:FS01:0x24:*********************** Log Started 2024-11-06T16:13:31Z ***********************
16:13:32:WU00:FS01:0x24:*************************** Core24 Folding@home Core ***************************
16:13:32:WU00:FS01:0x24:       Core: Core24
16:13:32:WU00:FS01:0x24:       Type: 0x24
16:13:32:WU00:FS01:0x24:    Version: 8.1.4
16:13:32:WU00:FS01:0x24:     Author: Joseph Coffland <joseph@cauldrondevelopment.com>
16:13:32:WU00:FS01:0x24:  Copyright: 2022 foldingathome.org
16:13:32:WU00:FS01:0x24:   Homepage: https://foldingathome.org/
16:13:32:WU00:FS01:0x24:       Date: Jul 25 2024
16:13:32:WU00:FS01:0x24:       Time: 05:42:49
16:13:32:WU00:FS01:0x24:   Revision: cf9f0139862b8945a2091772770e4631aac37792
16:13:32:WU00:FS01:0x24:     Branch: HEAD
16:13:32:WU00:FS01:0x24:   Compiler: Visual C++
16:13:32:WU00:FS01:0x24:    Options: $( /TP $) /std:c++14 /nologo /EHa /wd4297 /wd4103 /O2
16:13:32:WU00:FS01:0x24:             /Zc:throwingNew /MT -DOPENMM_VERSION=\"\\\"8.1.1\\\"\" /Ox /std:c++14
16:13:32:WU00:FS01:0x24:   Platform: win32 10
16:13:32:WU00:FS01:0x24:       Bits: 64
16:13:32:WU00:FS01:0x24:       Mode: Release
16:13:32:WU00:FS01:0x24:Maintainers: John Chodera <john.chodera@choderalab.org> and Peter Eastman
16:13:32:WU00:FS01:0x24:             <peastman@stanford.edu>
16:13:32:WU00:FS01:0x24:       Args: -dir 00 -suffix 01 -version 706 -lifeline 6888 -checkpoint 15
16:13:32:WU00:FS01:0x24:             -opencl-platform 0 -opencl-device 0 -cuda-device 0 -gpu-vendor
16:13:32:WU00:FS01:0x24:             nvidia -gpu 0 -gpu-usage 100
16:13:32:WU00:FS01:0x24:************************************ libFAH ************************************
16:13:32:WU00:FS01:0x24:       Date: Jul 25 2024
16:13:32:WU00:FS01:0x24:       Time: 05:23:50
16:13:32:WU00:FS01:0x24:   Revision: c7d2824a47eb025fa8cda8968c7a5e971585d90c
16:13:32:WU00:FS01:0x24:     Branch: HEAD
16:13:32:WU00:FS01:0x24:   Compiler: Visual C++
16:13:32:WU00:FS01:0x24:    Options: $( /TP $) /nologo /EHa /wd4297 /wd4103 /O2 /Zc:throwingNew /MT
16:13:32:WU00:FS01:0x24:   Platform: win32 10
16:13:32:WU00:FS01:0x24:       Bits: 64
16:13:32:WU00:FS01:0x24:       Mode: Release
16:13:32:WU00:FS01:0x24:************************************ CBang *************************************
16:13:32:WU00:FS01:0x24:    Version: 1.7.2
16:13:32:WU00:FS01:0x24:     Author: Joseph Coffland <joseph@cauldrondevelopment.com>
16:13:32:WU00:FS01:0x24:        Org: Cauldron Development LLC
16:13:32:WU00:FS01:0x24:  Copyright: Cauldron Development LLC, 2003-2024
16:13:32:WU00:FS01:0x24:   Homepage: https://cauldrondevelopment.com/
16:13:32:WU00:FS01:0x24:    License: LGPL-2.1-or-later
16:13:32:WU00:FS01:0x24:       Date: Jul 25 2024
16:13:32:WU00:FS01:0x24:       Time: 05:22:43
16:13:32:WU00:FS01:0x24:   Revision: f1cd4c791e8c40a35dcfeab3ab85d910949cc0cb
16:13:32:WU00:FS01:0x24:     Branch: HEAD
16:13:32:WU00:FS01:0x24:   Compiler: Visual C++
16:13:32:WU00:FS01:0x24:    Options: $( /TP $) /nologo /EHa /wd4297 /wd4103 /O2 /Zc:throwingNew /MT
16:13:32:WU00:FS01:0x24:   Platform: win32 10
16:13:32:WU00:FS01:0x24:       Bits: 64
16:13:32:WU00:FS01:0x24:       Mode: Release
16:13:32:WU00:FS01:0x24:************************************ System ************************************
16:13:32:WU00:FS01:0x24:        CPU: 12th Gen Intel(R) Core(TM) i7-12700F
16:13:32:WU00:FS01:0x24:     CPU ID: GenuineIntel Family 6 Model 151 Stepping 2
16:13:32:WU00:FS01:0x24:       CPUs: 16
16:13:32:WU00:FS01:0x24:     Memory: 63.89GiB
16:13:32:WU00:FS01:0x24:Free Memory: 57.44GiB
16:13:32:WU00:FS01:0x24: OS Version: 10.0
16:13:32:WU00:FS01:0x24:Has Battery: false
16:13:32:WU00:FS01:0x24: On Battery: false
16:13:32:WU00:FS01:0x24:   Hostname: AndrePK
16:13:32:WU00:FS01:0x24: UTC Offset: 3
16:13:32:WU00:FS01:0x24:        PID: 5892
16:13:32:WU00:FS01:0x24:        CWD: C:\\ProgramData\\FAHClient\\work
16:13:32:WU00:FS01:0x24:       Exec: C:\\ProgramData\\FAHClient\\cores\\cores.foldingathome.org\\openmm-core-24\\windows-10-64bit\\release\\0x24-8.1.4\\Core_24.fah\\FahCore_24.exe
16:13:32:WU00:FS01:0x24:************************************ OpenMM ************************************
16:13:32:WU00:FS01:0x24:    Version: 8.1.1
16:13:32:WU00:FS01:0x24:********************************************************************************
16:13:32:WU00:FS01:0x24:Project: 18237 (Run 648, Clone 1, Gen 0)
16:13:32:WU00:FS01:0x24:Reading tar file core.xml
16:13:32:WU00:FS01:0x24:Reading tar file integrator.xml
16:13:32:WU00:FS01:0x24:Reading tar file state.xml.bz2
16:13:32:WU00:FS01:0x24:Reading tar file system.xml.bz2
16:13:32:WU00:FS01:0x24:Digital signatures verified
16:13:32:WU00:FS01:0x24:Folding@home GPU Core24 Folding@home Core
16:13:32:WU00:FS01:0x24:Version 8.1.4
16:13:32:WU00:FS01:0x24:  Checkpoint write interval: 50000 steps (2%) [50 total]
16:13:32:WU00:FS01:0x24:  JSON viewer frame write interval: 25000 steps (1%) [100 total]
16:13:32:WU00:FS01:0x24:  XTC frame write interval: 10000 steps (0.4%) [250 total]
16:13:32:WU00:FS01:0x24:  TRR frame write interval: disabled
16:13:32:WU00:FS01:0x24:  Global context and integrator variables write interval: disabled
16:13:32:WU00:FS01:0x24:There are 4 platforms available.
16:13:32:WU00:FS01:0x24:Platform 0: Reference
16:13:32:WU00:FS01:0x24:Platform 1: CPU
16:13:32:WU00:FS01:0x24:Platform 2: OpenCL
16:13:32:WU00:FS01:0x24:  opencl-device 0 specified
16:13:32:WU00:FS01:0x24:Platform 3: CUDA
16:13:32:WU00:FS01:0x24:  cuda-device 0 specified
16:13:38:WU00:FS01:0x24:Attempting to create CUDA context:
16:13:38:WU00:FS01:0x24:  Configuring platform CUDA
16:13:39:WU00:FS01:0x24:ERROR:Potential energy error of 423.075, threshold of 20
16:13:39:WU00:FS01:0x24:ERROR:Reference Potential Energy: -1.9487e+06 | Given Potential Energy: -1.94913e+06
16:13:39:WU00:FS01:0x24:Saving result file ..\\logfile_01.txt
16:13:39:WU00:FS01:0x24:Saving result file science.log
16:13:39:WU00:FS01:0x24:Saving result file state.xml.bz2
16:13:40:WU00:FS01:0x24:Folding@home Core Shutdown: BAD_WORK_UNIT
16:13:40:WARNING:WU00:FS01:FahCore returned: BAD_WORK_UNIT (114 = 0x72)
16:13:40:WU00:FS01:Sending unit results: id:00 state:SEND error:FAULTY project:18237 run:648 clone:1 gen:0 core:0x24 unit:0x00000001000000000000473d00000288
16:13:40:WU00:FS01:Uploading 9.57MiB to 158.130.118.23
16:13:40:WU00:FS01:Connecting to 158.130.118.23:8080
16:13:40:WU01:FS01:Connecting to assign1.foldingathome.org:80
16:13:41:WU01:FS01:Assigned to work server 158.130.118.26
16:13:41:WU01:FS01:Requesting new work unit for slot 01: gpu:1:0 AD102 [GeForce RTX 4090] from 158.130.118.26
16:13:41:WU01:FS01:Connecting to 158.130.118.26:8080
16:13:42:WU01:FS01:Downloading 23.42MiB
16:13:46:WU00:FS01:Upload 54.85%
16:13:48:WU01:FS01:Download 38.17%
16:13:48:WU02:FS01:Upload 15.27%
16:13:51:WU00:FS01:Upload complete
16:13:51:WU00:FS01:Server responded WORK_ACK (400)
Image
BobWilliams757
Posts: 519
Joined: Fri Apr 03, 2020 2:22 pm
Hardware configuration: ASRock X370M PRO4
Ryzen 2400G APU
16 GB DDR4-3200
MSI GTX 1660 Super Gaming X

Re: Corrupted / bad job 18237/1069/0/71 (failing for all users)

Post by BobWilliams757 »

Nicolas_orleans wrote: Thu Nov 07, 2024 8:34 pm Hi Paul,

It's a brand new card and, again, it's only for this particular project, I would be more hardware-focused if it happened with the 16 other projects I am currently being assigned ? It only happens with this one.

Regarding Core24, it runs without any error on all P18230 WUs received so far, but not with P18237 on my rig.

I don't want to hijack this thread, but sharing a candid question with you: Core22 scales great on my 4080 Super (like 92-94% GPU utilization), Core23 scales fantastically (96-100% GPU utilization), but Core24 does not (like < 90% GPU utilization most of the time). Any reason for that, is it a "regression" in recent OpenMM versions ? Do you see it also with your 4090 ?
The different loads applies by various projects sometimes play with things that are otherwise stable. I had on project on my (then) new GPU that had me scratching my head for quite a while. Everything else was stable. And the only thing that helped seemed to be raising the power limit higher. I had been running it at 53% power limit in MSI Afterburner, but raising the power limit up helped.... with that one specific project.

You might want to look at max power draws, clocks, memory, etc as compared to other projects which are stable for you. It might give you a hint as to what is going on.
Fold them if you get them!
Post Reply