My computer spent 10 hours to complete this project.{Project: 14182 (Run 15, Clone 227, Gen 19)}
However, I got a dumping result.
Does anyone have idea what happened?
It looks like the server didn't like your completed work. If you are overclocking, revert back to factory settings to test as this can result in failed calculations. Lastly, an insufficient power supply or power regulation can result in failed calculations. With Windows, you can check your voltages pretty easily with HWMonitor or a few different others. viewtopic.php?f=19&t=25200&start=15 Low voltages on the rails can cause problems.
lazyacevw wrote:It looks like the server didn't like your completed work. If you are overclocking, revert back to factory settings to test as this can result in failed calculations. Lastly, an insufficient power supply or power regulation can result in failed calculations. With Windows, you can check your voltages pretty easily with HWMonitor or a few different others. viewtopic.php?f=19&t=25200&start=15 Low voltages on the rails can cause problems.
Also, check out here: viewtopic.php?f=19&t=16526
Thanks for your reply!
I think it might be I let my laptop run for more than 24 hours, and the laptop's voltage is not stable enough.
I've had the same thing on the same IP address: 155.247.166.219 . My system is/seems to be stable, no overclocks. On the Dutch power Cows forum some people also mention this server.
My failed WU:
14:59:16:WU02:FS01:0x22:Project: 11764 (Run 0, Clone 6502, Gen 29)
17:27:48:WU02:FS01:Uploading 55.24MiB to 155.247.166.219
17:29:28:WU02:FS01:Server responded WORK_QUIT (404)
17:29:28:WARNING:WU02:FS01:Server did not like results, dumping
Seems the 55.24 WU's are now accepted for upload. But someone also mentioned another WU size.
Haven't got many detailed reports yet, but might be a thing to check out.
Edit: go check here, judge for yourself (might need Google translate, since all is Dutch, Dutch speak English fine btw,): https://gathering.tweakers.net/forum/li ... 80390/last
12:41:12:WU02:FS00:Uploading 10.65MiB to 128.252.203.9
12:41:12:WU02:FS00:Connecting to 128.252.203.9:8080
12:41:21:WU02:FS00:Upload 0.59%
12:41:45:39:127.0.0.1:New Web connection
12:42:34:WU02:FS00:Upload 1.17%
12:42:34:WARNING:WU02:FS00:Exception: Failed to send results to work server: Transfer failed
12:42:34:WU02:FS00:Trying to send results to collection server
12:42:34:WU02:FS00:Uploading 10.65MiB to 155.247.166.219
12:42:34:WU02:FS00:Connecting to 155.247.166.219:8080
12:42:40:WU02:FS00:Upload 26.99%
12:42:46:WU02:FS00:Upload 56.32%
12:42:52:WU02:FS00:Upload 85.65%
12:42:55:WU02:FS00:Upload complete
12:42:55:WU02:FS00:Server responded WORK_QUIT (404)
12:42:55:WARNING:WU02:FS00:Server did not like results, dumping
I've done over 199 WU's in the past two weeks so I've looked back at my logs over the past two days to see if that server is responding the same way to my client. No dumps occurred for that IP but my client has tried to upload to 155.247.166.219 about 5 times unsuccessfully (probably overloaded). Each time, the server is unavailable with "HTTP_SERVICE_UNAVAILABLE" and another server picks it up. Probably for the best!
However, over the past 3 days, I have had three dumps:
14328 (Run 8, Clone 4395, Gen 13) - 0x0000000f9bf7a4d65e6d0c040380eca1 - 155.247.166.220:8080. No processing or obvious errors.
14576 (Run 0, Clone 103, Gen 0) - 0x00000000287234c95e7924259835e046 - ERROR:There is no domain decomposition for 25 ranks that is compatible with the given box and a minimum cell size of 1.37225 nm
14576 (Run 0, Clone 73, Gen 0) - 0x00000001287234c95e792433e86fa7d3 - ERROR:There is no domain decomposition for 25 ranks that is compatible with the given box and a minimum cell size of 1.37225 nm
It looks like one poorly crafted WU and one unspecified fault. I opened a thread on the last one: viewtopic.php?f=19&t=33215
I wouldn't call it an unspecified fault ... but let me try to explain.
GROMACS was originally designed to run on the researcher's own server, where he could adjust the parameters and restart the simulation easily if it didn't happen to run. When it was adapted to run on all sorts of computers with no opportunity for the researcher to think about what he could do to get a successful run, that imposed unplanned constraints ... and since the re-run would be on a different computer with different characteristics, there are likely to be cases where the error message doesn't really help in real-time.
Suppose you have a protein in a cube of solvent that's X0 units long and you're analyizeing it with one CPU. It'sll probably run. Now suppose you want it to run faster so you split up the cube into two slices and assign some atoms to different CPUs. Obviously you have to be able to recombine those two slices but each slice will compute in half the time. You still have to account for atoms which have a neighboring atom that were separated by the slicing operation, so there has to be some kind of overlap. at the edges where the cube was parted.
Now suppose you want to cut up the cube in 5 slices in each of the X, Y, and Z directions (also known as 25 ranks). Chances are there will be all kinds of situations where a small group of atoms (in a cell) finds the cell has been sliced making it virtually impossible to reassemble the atoimc forces and atomic motions are too complex to proceed Now (without drawing any pictures) try to explain to a programmer what he has to do to properly split, analyze, and recombine this cell and still get the same answer that would have been computed if the analysis was never split across 25 different CPUs. ... or as an alternative, try to explain to the scientist who constructed the protein and the box and the cell in what way they "poorly crafted this WU" and what they could have done differently.
Simplest work-around: Reconfigure the project so it cannot be sliced up in 5 x 5 x 5 slices and hope whoever gets the next WU doesn't run into the same problem simply by not cutting at the same places
Great background info! I knew there was a good reason why smaller projects like rosetta@home only have CPU based workloads. The task of making the programming side of it work is absolutely complex and will forever need tending to keep running smoothly. The company I work for has some custom built software and it works ok, but over the years slight changes have been made to its operating environment and now it has hiccups every now and again if you push it too hard. The problem is that now the company doesn't have the resources to rework the program anymore to keep it operating smoothly. So now, us as sysadmins have to do the best we can to work around the issue and keep the environment in a state that the software needs to stay running, error free.
I remember reading that F@H luckily got assistance from NVIDIA and a few other companies to get their GPU processing routines up and running. I don't envy those who manage F@H and keep it running but I certainly would like to do everything I can to help! I'm sure that is in some small way at least, why everybody is donating resources.