Page 2 of 2

Re: 10138/10139/10140 EUE

Posted: Mon Apr 22, 2013 11:24 am
by PantherX
AFAICT, SMP:7 and SMP:10 are problematic when it comes to the decomposition of small projects (Thus these slots will no longer be getting these projects). I am unsure of what SMP:9 would do in this case. If you can test it out and report it, it would be nice.

Re: 10138/10139/10140 EUE

Posted: Mon Apr 22, 2013 11:58 am
by PantherX
Humanoid1 wrote:...I was trusting this new 7.3.6 client to take/(work with) only WU's that were happy with an odd number of SMP cores and being a prime number at that....Obviously I am left with the conclusion that this newer 7.3.6 client has not been coded to fix this issue meaning anyone running with default settings could be failing all such odd/prime number sensitive WU's...
The FAHClient doesn't have this feature nor will it in the foreseeable future. The reason is that FAHClient doesn't do the folding, the FahCores does. Thus, the FahCore is responsible for spawning the "correct" number of threads once it has been told what the SMP Slot is configured as. Moreover, the assignment of WU isn't done by the FAHClient, it is done by the Assignment Server once it gets the required information from FAHClient.
Humanoid1 wrote:...Other experienced long time folders from my home OCF forum confirmed this suspicion.
Now having restarted running SMP10 I just have to wait to receive another of these number sensitive WU's to be 100% sure...
Do note that while you have configured your Slot as SMP:11, it is in fact, running as SMP:10 since the FahCore is automatically remapping from 11 to 10 as shown in your log:
08:55:27:WU02:FS01:0xa3:Mapping NT from 11 to 10

The reason for the remapping is that a rather significant number of project would fail on SMP:11 thus, it was blacklisted on the FahCore level. SMP:5, SMP:7 would fail on a minority of projects, specifically those which are small in size (atom count).

As stated, SMP:6, SMP:8 and SMP:12 will work. SMP:7 and SMP:10 will fail. However, I am unsure of what SMP:9 would do (would be nice to test it as previously mentioned).

Re: 10138/10139/10140 EUE

Posted: Mon Apr 22, 2013 1:02 pm
by Humanoid1
Thanks for the responses (and redirect to this thread/forum section I had missed in my search ;))

Some great information there cheers PantherX, I try to stay well informed and is good to get a few points clarified.

After I successfully clear a few SMP WU's and repair my % to ensure the QRB during this last day or so of Chimp Challenge I will give SMP9 a try and update the results in this thread.

I get the impression you don't need the details of the failed projects I had with SMP11(corrected to 10) now we know what is going on.
If you would like them anyways, I will dig them out and post them for you.

Cheers,

Humanoid1

Re: 10138/10139/10140 EUE

Posted: Mon Apr 22, 2013 2:36 pm
by Jesse_V
Just to clarify, it's not the client's fault. The problem has to do with how the WU is decomposed and spread across N cores. Some WUs are susceptible to this problem, others aren't. It'd take some more complex server logic to assign WUs such that this problem could be avoided. An easier workaround is to tweak the SMP:N setting client-side.

What you did there, I see it. Apollo 13 is a great movie. :)

Re: 10138/10139/10140 EUE

Posted: Mon Apr 22, 2013 3:52 pm
by bruce
Also, there's really no way for the Pande Group to predict which projects will fail with, say SMP:10 or even which WUs from a particular project. Once somebody reports problems with, say 10, the assignments can be restricted so that the project will avoid being given to machines configured with that number, but if nobody happened to run 10 while beta testing or if the WUs that were beta tested didn't happen to encounter this problem, the problem might not be discovered until later. Specific numbers, like SMP:11, are known to fail with a much higher probability, so the client will automatically map a setting of 11 to 10, but lots of projects are successful with 10 so it is not automatically remapped.

Has anybody tried smp:9?

Re: 10138/10139/10140 EUE

Posted: Mon Apr 22, 2013 4:18 pm
by Yasgur
Had several failures on the 10140's with default settings on v7.3.6. Am running a 3960x and a pair of 6970's, and after setting SMP to 8, it seems to be okay.

Re: 10138/10139/10140 EUE

Posted: Mon Apr 22, 2013 4:38 pm
by bruce
Yasgur wrote:Had several failures on the 10140's with default settings on v7.3.6. Am running a 3960x and a pair of 6970's, and after setting SMP to 8, it seems to be okay.
Default settings varies depending on your hardware. You need to describe it in more detail. Do you have an 8-way system that defaults to smp:7 or something else? (Your hardware configuration is not listed in your profile.)

Re: 10138/10139/10140 EUE

Posted: Mon Apr 22, 2013 6:12 pm
by Yasgur
The default is SMP 12. It's a 6 core cpu (Intel 3960X) and two gpu's in SLI (Radeon 6970's). I'll update my profile.

Re: 10138/10139/10140 EUE

Posted: Mon Apr 22, 2013 6:32 pm
by PantherX
With default settings, it might have been SMP:11 but the FahCore would have remapped it to SMP:10 which is problematic.

However, the assignment settings have been tweaked so SMP:8 (excluding SMP:7) is the maximum now.

Re: 10138/10139/10140 EUE

Posted: Mon Apr 22, 2013 6:42 pm
by Yasgur
Aye, you're right, SMP 11. I was looking at my nVidia folder running a 6 core 3930k on v7.2.9 and it shows SMP 12. Sorry about that.

Re: 10138/10139/10140 EUE

Posted: Mon Apr 22, 2013 8:10 pm
by Humanoid1
Thanks for the update PantherX

was good to get this sorted out so fast!

Re: 10138/10139/10140 EUE

Posted: Sun Apr 28, 2013 11:28 pm
by DexterThorphan
I did in fact experience problems with 10140 and core a3 just this past week as well. I was assigned it on a core2 duo. WU was Project 10140, Run 51, Clone 6, Gen 11 downloaded April 21 2013.

Code: Select all

17:25:17:WU01:FS00:Received Unit: id:01 state:DOWNLOAD error:NO_ERROR project:10140 run:51 clone:6 gen:11 core:0xa3 unit:0x0000000f0a3b1e6f5149edd521856ba3
17:25:18:WU01:FS00:Downloading project 10140 description
17:25:18:WU01:FS00:Connecting to fah-web.stanford.edu:80
17:25:18:WU01:FS00:Project 10140 description downloaded successfully
Some time later this core starts as the last WU finishes and begins uploading.

Code: Select all

18:25:22:WU01:FS00:Starting
18:25:22:WU01:FS00:Running FahCore: "(FAH_WRAPPER)" "(CORE_A3)" -dir 01 -suffix 01 -version 702 -lifeline 4080 -checkpoint 15 -np 2
18:25:23:WU01:FS00:Started FahCore on PID 3768
18:25:24:WU01:FS00:Core PID:3220
18:25:24:WU01:FS00:FahCore 0xa3 started
18:25:24:WU01:FS00:0xa3:
18:25:24:WU01:FS00:0xa3:*------------------------------*
18:25:24:WU01:FS00:0xa3:Folding@Home Gromacs SMP Core
18:25:24:WU01:FS00:0xa3:Version 2.27 (Dec. 15, 2010)
18:25:24:WU01:FS00:0xa3:
18:25:24:WU01:FS00:0xa3:Preparing to commence simulation
18:25:24:WU01:FS00:0xa3:- Looking at optimizations...
18:25:24:WU01:FS00:0xa3:- Created dyn
18:25:24:WU01:FS00:0xa3:- Files status OK
18:25:24:WU01:FS00:0xa3:- Expanded 969985 -> 2021624 (decompressed 208.4 percent)
18:25:24:WU01:FS00:0xa3:Called DecompressByteArray: compressed_data_size=969985 data_size=2021624, decompressed_data_size=2021624 diff=0
18:25:24:WU01:FS00:0xa3:- Digital signature verified
18:25:24:WU01:FS00:0xa3:
18:25:24:WU01:FS00:0xa3:Project: 10140 (Run 51, Clone 6, Gen 11)
18:25:24:WU01:FS00:0xa3:
18:25:24:WU01:FS00:0xa3:Assembly optimizations on if available.
18:25:24:WU01:FS00:0xa3:Entering M.D.
So far so good, I guess? But then...

Code: Select all

18:25:30:WU01:FS00:0xa3:Mapping NT from 2 to 2 
18:25:31:WU01:FS00:0xa3:Completed 0 out of 2000000 steps  (0%)
19:29:53:WU01:FS00:0xa3:Completed 20000 out of 2000000 steps  (1%)
(Many progress iterations roughly every 60-63 min)
******************************** Date: 21/04/13 ********************************
******************************** Date: 22/04/13 ********************************
******************************** Date: 23/04/13 ********************************
******************************** Date: 24/04/13 ********************************
******************************** Date: 25/04/13 ********************************
00:54:35:WU01:FS00:0xa3:Completed 1500000 out of 2000000 steps  (75%)
01:58:22:WU01:FS00:0xa3:Completed 1520000 out of 2000000 steps  (76%)
03:02:23:WU01:FS00:0xa3:Completed 1540000 out of 2000000 steps  (77%)
04:06:52:WU01:FS00:0xa3:Completed 1560000 out of 2000000 steps  (78%)
05:12:10:WU01:FS00:0xa3:Completed 1580000 out of 2000000 steps  (79%)
06:18:06:WU01:FS00:0xa3:Completed 1600000 out of 2000000 steps  (80%)
******************************** Date: 25/04/13 ********************************
07:24:03:WU01:FS00:0xa3:Completed 1620000 out of 2000000 steps  (81%)
08:29:53:WU01:FS00:0xa3:Completed 1640000 out of 2000000 steps  (82%)
09:35:11:WU01:FS00:0xa3:Completed 1660000 out of 2000000 steps  (83%)
10:40:07:WU01:FS00:0xa3:Completed 1680000 out of 2000000 steps  (84%)
11:45:31:WU01:FS00:0xa3:Completed 1700000 out of 2000000 steps  (85%)
...when suddenly...

Code: Select all

12:19:50:ERROR:Exception: Accessing './work/01/wuinfo_01.dat': Not enough quota is available to process this command.
.....
...
..
.
...every 1 to 5 seconds, for around 36 hours, generating 10+ megs log file. Then disaster.

Code: Select all

22:20:56:WARNING:WU01:FS00:FahCore returned an unknown error code which probably indicates that it crashed
22:20:56:WARNING:WU01:FS00:FahCore returned: UNKNOWN_ENUM (-1073740777 = 0xc0000417)
22:21:02:WU01:FS00:Starting
22:21:02:WU01:FS00:Running FahCore:  "(FAH_WRAPPER)" "(CORE_A3)" -dir 01 -suffix 01 -version 702 -lifeline 4080 -checkpoint 15 -np 2
22:21:19:WU01:FS00:Started FahCore on PID 2172
22:21:27:WU01:FS00:Core PID:3028
22:21:27:WU01:FS00:FahCore 0xa3 started
22:21:28:ERROR:WU01:FS00:
22:21:28:ERROR:WU01:FS00:-------------------------------------------------------
22:21:28:ERROR:WU01:FS00:Program Folding@home, VERSION 4.5.4
22:21:28:ERROR:WU01:FS00:Source code file: gromacs-4.5.4\src\gmxlib\gmxfio.c, line: 519
22:21:28:ERROR:WU01:FS00:
22:21:28:ERROR:WU01:FS00:Can not open file:
22:21:28:ERROR:WU01:FS00:./work/01/wudata_01.tpr
22:21:28:ERROR:WU01:FS00:For more information and tips for troubleshooting, please check the GROMACS
22:21:28:ERROR:WU01:FS00:website at http://www.gromacs.org/Documentation/Errors
22:21:28:ERROR:WU01:FS00:-------------------------------------------------------
22:21:28:ERROR:WU01:FS00:
22:21:28:ERROR:WU01:FS00:Thanx for Using GROMACS - Have a Nice Day
22:21:28:ERROR:Exception: Accessing './work/01/wuinfo_01.dat': Not enough quota is available to process this command.
22:21:28:Server connection id=2 ended
22:21:28:Server connection id=20 ended
22:21:28:Server connection id=21 ended
22:21:28:Server connection id=22 ended
22:21:31:WARNING:WU01:FS00:FahCore returned an unknown error code which probably indicates that it crashed
22:21:32:WARNING:WU01:FS00:FahCore returned: UNKNOWN_ENUM (-1073741502 = 0xc0000142)
22:22:02:WU01:FS00:Starting
22:22:02:WU01:FS00:Running FahCore:  "(FAH_WRAPPER)" "(CORE_A3)" -dir 01 -suffix 01 -version 702 -lifeline 4080 -checkpoint 15 -np 2
22:22:02:WU01:FS00:Started FahCore on PID 2896
22:22:03:WU01:FS00:Core PID:2900
22:22:03:WU01:FS00:FahCore 0xa3 started
22:22:05:WARNING:WU01:FS00:FahCore returned an unknown error code which probably indicates that it crashed
22:22:05:WARNING:WU01:FS00:FahCore returned: UNKNOWN_ENUM (-1073741502 = 0xc0000142)
22:23:40:WU01:FS00:Starting
22:23:40:WU01:FS00:Running FahCore:  "(FAH_WRAPPER)" "(CORE_A3)" -dir 01 -suffix 01 -version 702 -lifeline 4080 -checkpoint 15 -np 2
22:23:40:WU01:FS00:Started FahCore on PID 220
This all happened totally unexpectedly and out of the blue, and only upon checking the machine did I see that FAH appeared to be crashing with windows exceptions. Additionally strange things were going on with CoreTemp which was also running, what actually tipped me off was an overtemp alarm that tripped, for some reason one core sensor had pegged to 100C (tjMax for the processor). I attempted a couple restarts and was seeing the same problems and the temp reading instantly spike from 45-50 to "100C(?)" soon after starting the core, which soon crashed. Finally leaving that box shut down for a couple days until I could get to addressing it I found the problem "fixed itself", that is the WU apparently dumped and I am now running an 8089, with no problems from a4 or CoreTemp.

In summation WTF? Sorry for the long post but I hope the above can be of use in diagnosing a possible problem lurking in a3.

Re: 10138/10139/10140 EUE

Posted: Mon Apr 29, 2013 6:08 am
by bruce
DexterThorphan wrote:In summation WTF? Sorry for the long post but I hope the above can be of use in diagnosing a possible problem lurking in a3.
I'd say the probability that it's a problem lurking in a3 is terribly close to zero. A fast rise to excessive temperatures cannot be caused by software, but rather by some kind of failure in the cooling system. Maybe the fan stalled. Maybe the HS came loose. Maybe the VRs allowed the voltage to reach improper levels. There probably are other things that might have happened, but the cooling system has to be designed to keep the chip from overheating, even if the software manages to get the hardware to 99.99% utilization.