Page 3 of 3
Re: Various Project: 601x problems
Posted: Thu Apr 01, 2010 7:00 pm
by Aardvark
@curby.net;
I think you make a very interesting observation in that you have been able to run the v2.17 a3core on a Mac using Leopard with success. I believe that every Mac user that has reported on this problem has indicated that they are using Snow Leopard. I no longer have a machine with Leopard installed so I can't run any trials. I do know that my Mac Mini ran the Snow Leopard/v2.15core combination very well. That combination ran the 10 WUs that I needed to qualify for the Bonus Point Program without a failure.
It is just possible that we are experiencing some unanticipated interaction between OSX10.6.2 and the v2.17 a3core resulting in marginal folding (aka Failure).
I had made an earlier suggestion that it might be worthwhile to identify the Macs having the "troubles" and allow them to return to the v2.15 a3core. That suggestion did not seem to have any traction.
I certainly agree that we would be better off with some solution that allowed us to run at 80-90% of max capacity than to be caught in the "failure mire" we are presently in.
It would be of interest if any OSX10.6.2 users running the v2.17 a3core and NOT experiencing these early failures would post in this thread and describe what hardware they are using. It might add to the solution, while we are at it....
Re: Various Project: 601x problems
Posted: Thu Apr 01, 2010 7:17 pm
by 7im
The newer cores contain new scientific processes needed by the project, so going backwards isn't really an option. However, a quick resolution going forward is needed. And as with any problem, the better we can describe the problem, the easier it is to fix it.
Yes, please do post more info about hardware and software versions. We'll find the pattern and then have PG work on a solution.
Project: 6012 (Run 1, Clone 290, Gen 25)
Posted: Sat Apr 03, 2010 11:33 am
by Aardvark
Subject WU resulted in an early failure. WU failed before the 2% mark.
Client assigned an UNSTABLE_MACHINE category to WU.
Work file seems to be suitable for satisfactory return to Stanford.
Immediately prior to this failed WU I processed Project 6020 R 0 C 45 G 103 successfully.
Log file for failed WU follows:
Code: Select all
[10:18:04]
[10:18:04] + Processing work unit
[10:18:04] Core required: FahCore_a3.exe
[10:18:04] Core found.
[10:18:04] Working on queue slot 08 [April 3 10:18:04 UTC]
[10:18:04] + Working ...
[10:18:04] - Calling './FahCore_a3.exe -dir work/ -nice 19 -suffix 08 -np 2 -checkpoint 15 -verbose -lifeline 65033 -version 629'
[10:18:04]
[10:18:04] *------------------------------*
[10:18:04] Folding@Home Gromacs SMP Core
[10:18:04] Version 2.17 (Mar 7 2010)
[10:18:04]
[10:18:04] Preparing to commence simulation
[10:18:04] - Looking at optimizations...
[10:18:04] - Created dyn
[10:18:04] - Files status OK
[10:18:05] - Expanded 1797617 -> 2078149 (decompressed 115.6 percent)
[10:18:05] Called DecompressByteArray: compressed_data_size=1797617 data_size=2078149, decompressed_data_size=2078149 diff=0
[10:18:05] - Digital signature verified
[10:18:05]
[10:18:05] Project: 6012 (Run 1, Clone 290, Gen 25)
[10:18:05]
[10:18:05] Assembly optimizations on if available.
[10:18:05] Entering M.D.
Starting 2 threads
NNODES=2, MYRANK=0, HOSTNAME=thread #0
NNODES=2, MYRANK=1, HOSTNAME=thread #1
Reading file work/wudata_08.tpr, VERSION 4.0.99_development_20090605 (single precision)
Note: tpx file_version 68, software version 70
Making 1D domain decomposition 2 x 1 x 1
starting mdrun 'Protein in POPC'
13000001 steps, 26000.0 ps (continuing from step 12500001, 25000.0 ps).
[10:18:12] Completed 0 out of 500000 steps (0%)
[10:39:59] Completed 5000 out of 500000 steps (1%)
-------------------------------------------------------
Program mdrun, VERSION 4.0.99-dev-20100305
Source code file: /Users/kasson/a3_devnew/gromacs/src/mdlib/pme.c, line: 563
Fatal error:
13 particles communicated to PME node 1 are more than a cell length out of the domain decomposition cell of their charge group in dimension x
For more information and tips for trouble shooting please check the GROMACS website at
http://www.gromacs.org/Documentation/Errors
-------------------------------------------------------
Thanx for Using GROMACS - Have a Nice Day
[10:46:15] mdrun returned 255
[10:46:15] Going to send back what have done -- stepsTotalG=500000
[10:46:15] Work fraction=0.0129 steps=500000.
[10:46:19] logfile size=12500 infoLength=12500 edr=0 trr=25
[10:46:19] logfile size: 12500 info=12500 bed=0 hdr=25
[10:46:19] - Writing 13038 bytes of core data to disk...
[10:46:19] ... Done.
[10:46:19]
[10:46:19] Folding@home Core Shutdown: UNSTABLE_MACHINE
[10:46:20] CoreStatus = 7A (122)
[10:46:20] Sending work to server
[10:46:20] Project: 6012 (Run 1, Clone 290, Gen 25)
[10:46:20] + Attempting to send results [April 3 10:46:20 UTC]
[10:46:20] - Reading file work/wuresults_08.dat from core
[10:46:20] (Read 13038 bytes from disk)
[10:46:21] > Press "c" to connect to the server to upload results
Folding is Fun......
Re: Various Project: 601x problems
Posted: Sat Apr 03, 2010 2:14 pm
by kasson
FYI our build system for 10.5+ is snow leopard; that system hasn't had any problems with any of these cores. But I'm not sure it's running 10.6.2; it's always possible there was a nasty change to some of the shared libraries or something.
Re: Various Project: 601x problems
Posted: Sat Apr 03, 2010 2:53 pm
by Aardvark
And just to make the game more interesting, 10.6.3 is available and entering the Folding mix. I haven't done the upgrade yet but am anticipating doing so. Is there anything suspicious enough here that would indicate a delay in going to 10.6.3 is advisable???
Re: Various Project: 601x problems
Posted: Sat Apr 03, 2010 5:48 pm
by curby.net
Aardvark wrote:I certainly agree that we would be better off with some solution that allowed us to run at 80-90% of max capacity than to be caught in the "failure mire" we are presently in.
I did a throttling test as posted before and it didn't seem to help with the errors even though it reduced the temperatures and average cpu loads. That test combined with stresscpu performance and performance with other folding cores makes me think that the issue isn't bad ram, insufficient cooling, or otherwise hardware related. That Aardvark and other users are experiencing corroborating results only strengthens this hunch.
Aardvark wrote:It would be of interest if any OSX10.6.2 users running the v2.17 a3core and NOT experiencing these early failures would post in this thread and describe what hardware they are using. It might add to the solution, while we are at it....
Great idea, but I never would have visited this thread if I hadn't had problems. Perhaps a thread elsewhere soliciting reports from users successfully folding with this combo would get more eyes.
Thanks for all the help, folks! We'll knock this out yet.