Page 1 of 1

Server did not like results, dumping

Posted: Thu Mar 26, 2020 10:57 am
by CrazyRyan
My computer spent 10 hours to complete this project.{Project: 14182 (Run 15, Clone 227, Gen 19)}

However, I got a dumping result. :shock:
Does anyone have idea what happened?

Code: Select all

01:01:39:WU03:FS00:Starting
01:01:39:WU03:FS00:Running FahCore: "D:\Program Files (x86)\FAHClient/FAHCoreWrapper.exe" "D:\Temp files\FAHClient\cores/cores.foldingathome.org/v7/win/64bit/avx/Core_a7.fah/FahCore_a7.exe" -dir 03 -suffix 01 -version 705 -lifeline 20592 -checkpoint 7 -np 11
01:01:39:WU03:FS00:Started FahCore on PID 114308
01:01:39:WU03:FS00:Core PID:95292
01:01:39:WU03:FS00:FahCore 0xa7 started
01:01:40:WU03:FS00:0xa7:*********************** Log Started 2020-03-26T01:01:39Z ***********************
01:01:40:WU03:FS00:0xa7:************************** Gromacs Folding@home Core ***************************
01:01:40:WU03:FS00:0xa7:       Type: 0xa7
01:01:40:WU03:FS00:0xa7:       Core: Gromacs
01:01:40:WU03:FS00:0xa7:       Args: -dir 03 -suffix 01 -version 705 -lifeline 114308 -checkpoint 7 -np
01:01:40:WU03:FS00:0xa7:             11
01:01:40:WU03:FS00:0xa7:************************************ CBang *************************************
01:01:40:WU03:FS00:0xa7:       Date: Oct 26 2019
01:01:40:WU03:FS00:0xa7:       Time: 01:38:25
01:01:40:WU03:FS00:0xa7:   Revision: c46a1a011a24143739ac7218c5a435f66777f62f
01:01:40:WU03:FS00:0xa7:     Branch: master
01:01:40:WU03:FS00:0xa7:   Compiler: Visual C++ 2008
01:01:40:WU03:FS00:0xa7:    Options: /TP /nologo /EHa /wd4297 /wd4103 /Ox /MT
01:01:40:WU03:FS00:0xa7:   Platform: win32 10
01:01:40:WU03:FS00:0xa7:       Bits: 64
01:01:40:WU03:FS00:0xa7:       Mode: Release
01:01:40:WU03:FS00:0xa7:************************************ System ************************************
01:01:40:WU03:FS00:0xa7:        CPU: Intel(R) Core(TM) i7-8750H CPU @ 2.20GHz
01:01:40:WU03:FS00:0xa7:     CPU ID: GenuineIntel Family 6 Model 158 Stepping 10
01:01:40:WU03:FS00:0xa7:       CPUs: 12
01:01:40:WU03:FS00:0xa7:     Memory: 15.85GiB
01:01:34:WU03:FS00:Assigned to work server 155.247.166.219
01:01:34:WU03:FS00:Requesting new work unit for slot 00: READY cpu:11 from 155.247.166.219
01:01:34:WU03:FS00:Connecting to 155.247.166.219:8080
01:01:35:WU03:FS00:Downloading 1.72MiB
01:01:39:WU03:FS00:Download complete
01:01:39:WU03:FS00:Received Unit: id:03 state:DOWNLOAD error:NO_ERROR project:14182 run:15 clone:227 gen:19 core:0xa7 unit:0x000000150002894b5dae08925ed097c8

01:01:40:WU03:FS00:0xa7:Free Memory: 4.00GiB
01:01:40:WU03:FS00:0xa7:    Threads: WINDOWS_THREADS
01:01:40:WU03:FS00:0xa7: OS Version: 6.2
01:01:40:WU03:FS00:0xa7:Has Battery: true
01:01:40:WU03:FS00:0xa7: On Battery: false
01:01:40:WU03:FS00:0xa7: UTC Offset: 8
01:01:40:WU03:FS00:0xa7:        PID: 95292
01:01:40:WU03:FS00:0xa7:        CWD: D:\Temp files\FAHClient\work
01:01:40:WU03:FS00:0xa7:******************************** Build - libFAH ********************************
01:01:40:WU03:FS00:0xa7:    Version: 0.0.18
01:01:40:WU03:FS00:0xa7:     Author: Joseph Coffland <joseph@cauldrondevelopment.com>
01:01:40:WU03:FS00:0xa7:  Copyright: 2019 foldingathome.org
01:01:40:WU03:FS00:0xa7:   Homepage: https://foldingathome.org/
01:01:40:WU03:FS00:0xa7:       Date: Oct 26 2019
01:01:40:WU03:FS00:0xa7:       Time: 01:52:30
01:01:40:WU03:FS00:0xa7:   Revision: c1e3513b1bc0c16013668f2173ee969e5995b38e
01:01:40:WU03:FS00:0xa7:     Branch: master
01:01:40:WU03:FS00:0xa7:   Compiler: Visual C++ 2008
01:01:40:WU03:FS00:0xa7:    Options: /TP /nologo /EHa /wd4297 /wd4103 /Ox /MT
01:01:40:WU03:FS00:0xa7:   Platform: win32 10
01:01:40:WU03:FS00:0xa7:       Bits: 64
01:01:40:WU03:FS00:0xa7:       Mode: Release
01:01:40:WU03:FS00:0xa7:************************************ Build *************************************
01:01:40:WU03:FS00:0xa7:       SIMD: avx_256
01:01:40:WU03:FS00:0xa7:********************************************************************************
01:01:40:WU03:FS00:0xa7:Project: 14182 (Run 15, Clone 227, Gen 19)
01:01:40:WU03:FS00:0xa7:Unit: 0x000000150002894b5dae08925ed097c8
01:01:40:WU03:FS00:0xa7:Reading tar file core.xml
01:01:40:WU03:FS00:0xa7:Reading tar file frame19.tpr
01:01:40:WU03:FS00:0xa7:Digital signatures verified
01:01:40:WU03:FS00:0xa7:Reducing thread count from 11 to 10 to avoid domain decomposition by a prime number > 3
01:01:40:WU03:FS00:0xa7:Calling: mdrun -s frame19.tpr -o frame19.trr -cpt 7 -nt 10
01:01:40:WU03:FS00:0xa7:Steps: first=47500000 total=2500000
01:01:43:WU03:FS00:0xa7:Completed 1 out of 2500000 steps (0%)
……
10:25:31:WU03:FS00:0xa7:Completed 2500000 out of 2500000 steps (100%)
10:25:34:WU03:FS00:0xa7:Saving result file ..\logfile_01.txt
10:25:34:WU03:FS00:0xa7:Saving result file frame19.trr
10:25:34:WU03:FS00:0xa7:Saving result file md.log
10:25:34:WU03:FS00:0xa7:Saving result file science.log
10:25:34:WU03:FS00:0xa7:Saving result file traj_comp.xtc
10:25:34:WU03:FS00:0xa7:Folding@home Core Shutdown: FINISHED_UNIT
10:25:34:WU03:FS00:FahCore returned: FINISHED_UNIT (100 = 0x64)
10:25:34:WU03:FS00:Sending unit results: id:03 state:SEND error:NO_ERROR project:14182 run:15 clone:227 gen:19 core:0xa7 unit:0x000000150002894b5dae08925ed097c8
10:25:34:WU03:FS00:Uploading 5.43MiB to 155.247.166.219
10:25:34:WU03:FS00:Connecting to 155.247.166.219:8080
10:25:40:WU03:FS00:Upload 21.85%
10:25:46:WU03:FS00:Upload 44.86%
10:25:52:WU03:FS00:Upload 72.46%
10:25:58:WU03:FS00:Upload 98.91%
10:25:58:WU03:FS00:Upload complete
10:25:58:WU03:FS00:Server responded WORK_QUIT (404)
10:25:58:WARNING:WU03:FS00:Server did not like results, dumping
10:25:58:WU03:FS00:Cleaning up

Re: Server did not like results, dumping

Posted: Thu Mar 26, 2020 12:04 pm
by lazyacevw
It looks like the server didn't like your completed work. If you are overclocking, revert back to factory settings to test as this can result in failed calculations. Lastly, an insufficient power supply or power regulation can result in failed calculations. With Windows, you can check your voltages pretty easily with HWMonitor or a few different others. viewtopic.php?f=19&t=25200&start=15 Low voltages on the rails can cause problems.

Also, check out here: viewtopic.php?f=19&t=16526

Re: Server did not like results, dumping

Posted: Thu Mar 26, 2020 12:48 pm
by CrazyRyan
lazyacevw wrote:It looks like the server didn't like your completed work. If you are overclocking, revert back to factory settings to test as this can result in failed calculations. Lastly, an insufficient power supply or power regulation can result in failed calculations. With Windows, you can check your voltages pretty easily with HWMonitor or a few different others. viewtopic.php?f=19&t=25200&start=15 Low voltages on the rails can cause problems.

Also, check out here: viewtopic.php?f=19&t=16526
Thanks for your reply!
I think it might be I let my laptop run for more than 24 hours, and the laptop's voltage is not stable enough.

Re: Server did not like results, dumping

Posted: Thu Mar 26, 2020 6:25 pm
by toTOW
Let us know if the problem occurs again. In the case it happens again, you'll have to suspect your network connection ...

Re: Server did not like results, dumping

Posted: Thu Mar 26, 2020 6:42 pm
by Empie
I've had the same thing on the same IP address: 155.247.166.219 . My system is/seems to be stable, no overclocks. On the Dutch power Cows forum some people also mention this server.
My failed WU:

Code: Select all

14:59:16:WU02:FS01:0x22:Project: 11764 (Run 0, Clone 6502, Gen 29)
17:27:48:WU02:FS01:Uploading 55.24MiB to 155.247.166.219
17:29:28:WU02:FS01:Server responded WORK_QUIT (404)
17:29:28:WARNING:WU02:FS01:Server did not like results, dumping
Seems the 55.24 WU's are now accepted for upload. But someone also mentioned another WU size.
Haven't got many detailed reports yet, but might be a thing to check out.

Edit: go check here, judge for yourself (might need Google translate, since all is Dutch, Dutch speak English fine btw,): https://gathering.tweakers.net/forum/li ... 80390/last

Re: Server did not like results, dumping

Posted: Thu Mar 26, 2020 10:07 pm
by MarcusTral
Same problem with my WU and the 155.247.166.219 server.
Like Empie i have no modification and i'm connected directly to power.

Code: Select all

12:41:12:WU02:FS00:Uploading 10.65MiB to 128.252.203.9
12:41:12:WU02:FS00:Connecting to 128.252.203.9:8080
12:41:21:WU02:FS00:Upload 0.59%
12:41:45:39:127.0.0.1:New Web connection
12:42:34:WU02:FS00:Upload 1.17%
12:42:34:WARNING:WU02:FS00:Exception: Failed to send results to work server: Transfer failed
12:42:34:WU02:FS00:Trying to send results to collection server
12:42:34:WU02:FS00:Uploading 10.65MiB to 155.247.166.219
12:42:34:WU02:FS00:Connecting to 155.247.166.219:8080
12:42:40:WU02:FS00:Upload 26.99%
12:42:46:WU02:FS00:Upload 56.32%
12:42:52:WU02:FS00:Upload 85.65%
12:42:55:WU02:FS00:Upload complete
12:42:55:WU02:FS00:Server responded WORK_QUIT (404)
12:42:55:WARNING:WU02:FS00:Server did not like results, dumping

Re: Server did not like results, dumping

Posted: Fri Mar 27, 2020 7:21 am
by lazyacevw
I've done over 199 WU's in the past two weeks so I've looked back at my logs over the past two days to see if that server is responding the same way to my client. No dumps occurred for that IP but my client has tried to upload to 155.247.166.219 about 5 times unsuccessfully (probably overloaded). Each time, the server is unavailable with "HTTP_SERVICE_UNAVAILABLE" and another server picks it up. Probably for the best!

However, over the past 3 days, I have had three dumps:

14328 (Run 8, Clone 4395, Gen 13) - 0x0000000f9bf7a4d65e6d0c040380eca1 - 155.247.166.220:8080. No processing or obvious errors.
14576 (Run 0, Clone 103, Gen 0) - 0x00000000287234c95e7924259835e046 - ERROR:There is no domain decomposition for 25 ranks that is compatible with the given box and a minimum cell size of 1.37225 nm
14576 (Run 0, Clone 73, Gen 0) - 0x00000001287234c95e792433e86fa7d3 - ERROR:There is no domain decomposition for 25 ranks that is compatible with the given box and a minimum cell size of 1.37225 nm

It looks like one poorly crafted WU and one unspecified fault. I opened a thread on the last one: viewtopic.php?f=19&t=33215

Re: Server did not like results, dumping

Posted: Fri Mar 27, 2020 11:00 pm
by bruce
I wouldn't call it an unspecified fault ... but let me try to explain.

GROMACS was originally designed to run on the researcher's own server, where he could adjust the parameters and restart the simulation easily if it didn't happen to run. When it was adapted to run on all sorts of computers with no opportunity for the researcher to think about what he could do to get a successful run, that imposed unplanned constraints ... and since the re-run would be on a different computer with different characteristics, there are likely to be cases where the error message doesn't really help in real-time.

Suppose you have a protein in a cube of solvent that's X0 units long and you're analyizeing it with one CPU. It'sll probably run. Now suppose you want it to run faster so you split up the cube into two slices and assign some atoms to different CPUs. Obviously you have to be able to recombine those two slices but each slice will compute in half the time. You still have to account for atoms which have a neighboring atom that were separated by the slicing operation, so there has to be some kind of overlap. at the edges where the cube was parted.

Now suppose you want to cut up the cube in 5 slices in each of the X, Y, and Z directions (also known as 25 ranks). Chances are there will be all kinds of situations where a small group of atoms (in a cell) finds the cell has been sliced making it virtually impossible to reassemble the atoimc forces and atomic motions are too complex to proceed Now (without drawing any pictures) try to explain to a programmer what he has to do to properly split, analyze, and recombine this cell and still get the same answer that would have been computed if the analysis was never split across 25 different CPUs. ... or as an alternative, try to explain to the scientist who constructed the protein and the box and the cell in what way they "poorly crafted this WU" and what they could have done differently.

Simplest work-around: Reconfigure the project so it cannot be sliced up in 5 x 5 x 5 slices and hope whoever gets the next WU doesn't run into the same problem simply by not cutting at the same places

Re: Server did not like results, dumping

Posted: Sat Mar 28, 2020 6:59 am
by lazyacevw
Great background info! I knew there was a good reason why smaller projects like rosetta@home only have CPU based workloads. The task of making the programming side of it work is absolutely complex and will forever need tending to keep running smoothly. The company I work for has some custom built software and it works ok, but over the years slight changes have been made to its operating environment and now it has hiccups every now and again if you push it too hard. The problem is that now the company doesn't have the resources to rework the program anymore to keep it operating smoothly. So now, us as sysadmins have to do the best we can to work around the issue and keep the environment in a state that the software needs to stay running, error free.

I remember reading that F@H luckily got assistance from NVIDIA and a few other companies to get their GPU processing routines up and running. I don't envy those who manage F@H and keep it running but I certainly would like to do everything I can to help! I'm sure that is in some small way at least, why everybody is donating resources.