Page 1 of 1

Project: 2671 (Run 36, Clone 26, Gen 73)

Posted: Mon Jan 26, 2009 10:34 pm
by alpha754293
error/bombed out.

Here's the fahlog.txt

Code: Select all

[15:47:02] - Preparing to get new work unit...
[15:47:02] + Attempting to get work packet
[15:47:02] - Connecting to assignment server
[15:47:02] - Successful: assigned to (171.67.108.24).
[15:47:02] + News From Folding@Home: Welcome to Folding@Home
[15:47:02] Loaded queue successfully.
[15:47:19] + Closed connections
[15:47:19] 
[15:47:19] + Processing work unit
[15:47:19] Core required: FahCore_a2.exe
[15:47:19] Core found.
[15:47:19] Working on queue slot 04 [January 26 15:47:19 UTC]
[15:47:19] + Working ...
[15:47:19] 
[15:47:19] *------------------------------*
[15:47:19] Folding@Home Gromacs SMP Core
[15:47:19] Version 2.01 (Wed Aug 13 13:11:25 PDT 2008)
[15:47:19] 
[15:47:19] Preparing to commence simulation
[15:47:19] - Ensuring status. Please wait.
[15:47:20] Called DecompressByteArray: compressed_data_size=4840724 data_size=24028493, decompressed_data_size=24028493 diff=0
[15:47:20] - Digital signature verified
[15:47:20] 
[15:47:20] Project: 2671 (Run 36, Clone 26, Gen 73)
[15:47:20] 
[15:47:21] Assembly optimizations on if available.
[15:47:21] Entering M.D.
[15:47:30] Run 36, Clone 26, Gen 73)
[15:47:30] 
[15:47:30] Entering M.D.
[15:56:26] Completed 5008 out of 250000 steps  (2%)
[16:00:49] Completed 7508 out of 250000 steps  (3%)
[16:05:13] Completed 10008 out of 250000 steps  (4%)
[16:09:37] Completed 12508 out of 250000 steps  (5%)
[16:14:01] Completed 15008 out of 250000 steps  (6%)
[16:18:25] Completed 17508 out of 250000 steps  (7%)
[16:22:49] Completed 20008 out of 250000 steps  (8%)
[16:27:13] Completed 22508 out of 250000 steps  (9%)
[16:31:37] Completed 25008 out of 250000 steps  (10%)
[16:36:02] Completed 27508 out of 250000 steps  (11%)
[16:40:26] Completed 30008 out of 250000 steps  (12%)
[16:44:50] Completed 32508 out of 250000 steps  (13%)
[16:49:15] Completed 35008 out of 250000 steps  (14%)
[16:53:39] Completed 37508 out of 250000 steps  (15%)
[16:58:08] Completed 40008 out of 250000 steps  (16%)
[17:02:37] Completed 42508 out of 250000 steps  (17%)
[17:07:05] Completed 45008 out of 250000 steps  (18%)
here's what it says in console:

Code: Select all

------------------------------------------------------------------------

Thanx for Using GROMACS - Have a Nice Day

Error on node 3. will try to stop all the nodes
Halting parallel program mdrun on CPU 3 out of 8

gcq#0: Thanx for Using GROMACS - Have a Nice Day

[cli_3]: aborting job:
application called MP
------------------------------------------------------------------------
Program mdrun. VERSION 3.3.99_development_200800503
Source code file: nsgrid.c , line: 358

Range checking error:
Explanation: During neighborsearching, we assign each particle to a grid
based on its coordinates. If your system contains collisions or parameter
errors that give particles very high velocities you might end up with some
coordinates being +- Infinity orNaN (not-a-number). Obviously, we cannot
put these on a grid, so this is usually where we detect those errors.
Make sure your system is properly energy-minimized and that the potential
energy seems reasonable before trying again.

Variable ci has value -2147483649. It should have been withing [ 0 .. 1540 ]
------------------------------------------------------------------------

Thanx for using GROMACS - Have a Nice Day

Error on node 5, will try to stop all the nodes
Halting parallel program mdrun on CPU 5 out of 8

gcq#0: Thanx for Using GROMACS - Have a Nice Day

[cli_5]: aborting job:
application called MPI_Abort(MPI_COMM_WORLD, -1) - process 1
I_Abort(MPI_COMM_WORLD, -1) - process 3
I_Abort(MPI_COMM_WORLD, -1) - process 5
Run stopped. No prompt. F@H Halted. Abnormal program termination.

Suggestions?

Re: Project: 2671 (Run 36, Clone 26, Gen 73)

Posted: Mon Jan 26, 2009 11:58 pm
by toTOW
Isn't it the second time you get this kind of error ?

There's no data for this WU in the DB yet.

Re: Project: 2671 (Run 36, Clone 26, Gen 73)

Posted: Tue Jan 27, 2009 12:05 am
by alpha754293
toTOW wrote:Isn't it the second time you get this kind of error ?

There's no data for this WU in the DB yet.
Uh...honestly. Don't know.

The first time it said that it was because the molecule was unstable.

This time, I think that it's every so slightly different in the sense that I think that I may have an encountered a diverging solution which resulted in the velocities of the molecules to go to inf./NaN.

So they're computationally different (if I understand what it's reporting/saying correctly, or at least interpreting it correctly).

Instabilities can be detected via the FFT (I'm like...making a wild guess here) within the code.

Velocities is usually a position and/or first time derivative thereof, so while it may end up with similiar errors, the cause of it can be very different and mean very different things altogether.

I couldn't pull/read back into the console outputs because I was running with just text-only console.

I miss the good old days of the DEC/Alpha/VT100 terminals where you can scroll back up and then copy&paste. *sigh*

Re: Project: 2671 (Run 36, Clone 26, Gen 73)

Posted: Tue Jan 27, 2009 12:20 am
by bruce
alpha754293 wrote:Instabilities can be detected via the FFT (I'm like...making a wild guess here) within the code.
Only if FFT is part of the FahCore you're running. Different cores use different computational methods.
Velocities is usually a position and/or first time derivative thereof, so while it may end up with similiar errors, the cause of it can be very different and mean very different things altogether.
Not to belabor a point, but positions and velocities are both obtained by numerical integrals, not numerical derivatives -- with suitable adjustments for Brownian Motion.

Re: Project: 2671 (Run 36, Clone 26, Gen 73)

Posted: Tue Jan 27, 2009 12:56 am
by alpha754293
bruce wrote:
alpha754293 wrote:Instabilities can be detected via the FFT (I'm like...making a wild guess here) within the code.
Only if FFT is part of the FahCore you're running. Different cores use different computational methods.
Velocities is usually a position and/or first time derivative thereof, so while it may end up with similiar errors, the cause of it can be very different and mean very different things altogether.
Not to belabor a point, but positions and velocities are both obtained by numerical integrals, not numerical derivatives -- with suitable adjustments for Brownian Motion.
intergal ONLY if velocities are calculated first.

But considering that it's supposed to match up to some sort of grid, I would think that the program's probably tracking the time-dependent positions, and taking the derivative in order to obtain the velocity.

On the other hand, if it is calculating the velocities first, then yes, you are absolutely correct. I have no idea how they would solve the momentum equations (if that's indeed what they're using) to obtain the velocities.

I would think that FFTs would be one of the quicker way (again, wild guess here) in order to determine if there are any vibrational characteristics. You track the molecule's position as a function of time, and given that we're talking time scales of 10^-12, I would think that it wouldn't take much/long to be able to get FFT results.

I have no idea if F@H even took the FFT part out of the GROMACS core. *shrug* who knows. Based on GROMACS v4 user's manual, that's all I can find out. That and apparently they're still working implementing FFTW (3D decomposition for FFTs (needed for PME algorithm)). Source: wiki.gromacs.org

*shrug*

I'm a mechanical engineer by training, so this stuff pertaining to MD and F@H and biochemistry/computational chemistry/coding/programming -- it's over my head.

Re: Project: 2671 (Run 36, Clone 26, Gen 73)

Posted: Tue Jan 27, 2009 12:57 am
by alpha754293
Getting back on topic:

do I keep the WU? send it back to PandeGroup? purge? Let me know please whenever you can. Thanks.

Re: Project: 2671 (Run 36, Clone 26, Gen 73)

Posted: Tue Jan 27, 2009 1:48 am
by 7im
No specific user intervention is needed. The client is designed to handle errors as appropriate. Restart the client. It will either dump the WU, and download a new WU, or it will upload partial results, and then get a new WU. In either case, the server sees that you requested a new WU, and that is noted in the server logs. That's enough for Pande Group to act on that WU if they so choose.

Speaking from past experience, NaN errors are typically related to hardware problems in the computer. It doesn't mean the hardware is bad, but it might. A loose DIMM is just as problematic as having the incorrect RAM voltage set in the bios, or having the RAM timings set too aggresively.

And if another user completes these work units to 100 percent, that would be another indication there is a system problem. If others error out at the same place, then it's likely a bad WU. Be nice to the Mods and Admins, and they might check the WU logs for you again in a few days to see which way it went. :twisted:

Re: Project: 2671 (Run 36, Clone 26, Gen 73)

Posted: Tue Jan 27, 2009 4:03 am
by alpha754293
7im wrote:No specific user intervention is needed. The client is designed to handle errors as appropriate. Restart the client. It will either dump the WU, and download a new WU, or it will upload partial results, and then get a new WU. In either case, the server sees that you requested a new WU, and that is noted in the server logs. That's enough for Pande Group to act on that WU if they so choose.

Speaking from past experience, NaN errors are typically related to hardware problems in the computer. It doesn't mean the hardware is bad, but it might. A loose DIMM is just as problematic as having the incorrect RAM voltage set in the bios, or having the RAM timings set too aggresively.

And if another user completes these work units to 100 percent, that would be another indication there is a system problem. If others error out at the same place, then it's likely a bad WU. Be nice to the Mods and Admins, and they might check the WU logs for you again in a few days to see which way it went. :twisted:
I am nice. :D lol. j/k. sorta.

Interestingly enough, I had ONE case where I was running the client, stopped it. re-ran it, it bombed out. Then re-ran it again, and it work.

I'm so used to just freeze any/all transactions on something that's failed/bombed out computationally in case there's something that can be read/processed, etc. within whatever it was that failed in order to try and pinpoint the cause of failure.

It's like computational forensics after the program has died, you know?

*edit*
Here's something for your reading pleasure -- I just restarted that client right now (without purging the failed WU) AND so far it's running. Granted, it's only been 15 minutes or so, but....*shrug* *sigh* who knows what's going on there.

*throws arms up in air* I have NOOOO idea.

Re: Project: 2671 (Run 36, Clone 26, Gen 73)

Posted: Tue Jan 27, 2009 11:41 am
by alpha754293
It's been running for an 9 hours since I restarted the client. No further hitches so far.

I wonder what the heck happened originally.

Re: Project: 2671 (Run 36, Clone 26, Gen 73)

Posted: Tue Jan 27, 2009 3:11 pm
by alpha754293
WU completed successfully. Anybody here as confused as I am?

Re: Project: 2671 (Run 36, Clone 26, Gen 73)

Posted: Tue Jan 27, 2009 3:15 pm
by uncle_fungus
It could have been a random computational error. These things happen ;)