Project: 2652 (Run 0 Clone 236 Gen 23) Again

Moderators: Site Moderators, FAHC Science Team

Post Reply
ArVee
Posts: 121
Joined: Sun Dec 02, 2007 9:25 am

Project: 2652 (Run 0 Clone 236 Gen 23) Again

Post by ArVee »

2652(0,236,23) is a bad WU. It EUE's ("Gromacs cannot continue further") at the identical spot just after 13 frames complete, even after changing to a sharply reduced OC for the third attempt. Further, qfix doesn't seem to serve the purpose of getting the partial results u/l'd so that the unit can be identified and removed and the next person doesn't have to sit through three more instances of the same thing. It SAID it had done some fixing, but if it did, nothing reflected in the points, so I don't think anything made it in to Stanford.

There seems to be a lot of this with 2652. I know it taxes the system, but identical failure points say to me at least that it's another bad WU. If correct, why so many on 2652? :e?:
Last edited by 7im on Mon Dec 31, 2007 5:17 am, edited 2 times in total.
Reason: Changed the topic 1652->2652
ChelseaOilman
Posts: 1037
Joined: Sun Dec 02, 2007 3:47 pm
Location: Colorado @ 10,000 feet

Re: 2652 Again

Post by ChelseaOilman »

It does seem to be a bad WU. Multiple people have received partial credit for their effort.

Your among them:

Hi ArVee (team 328),
Your WU (P2652 R0 C236 G23) was added to the stats database on 2007-12-30 00:57:55 for 668.2 points of credit.
ArVee
Posts: 121
Joined: Sun Dec 02, 2007 9:25 am

Re: 2652 Again

Post by ArVee »

Thank you for checking that, I just noticed the points showing up late at EOC before I checked back here. It's good they made it in, so at least there's a record of the problem WU.
Qeldroma
Posts: 4
Joined: Mon Dec 31, 2007 4:34 pm

Re: Project: 2652 (Run 0 Clone 236 Gen 23) Again

Post by Qeldroma »

I'd like to confirm this- it's happening to our team too- the WUs fail three times at the same point but where it fails is different for each of us. Team Link

Thanks for the Qfix info-
al2
Posts: 10
Joined: Tue Jan 01, 2008 3:48 pm
Location: U.K.

Re: Project: 2652 (Run 0 Clone 236 Gen 23) Again

Post by al2 »

Well i've just had Project: 2652 *but* Run 0, Clone 430, Gen 44 and my system is all stock and likely very stable wrt win smp client since i've never have any issues like this before( i can remember) since i started folding last summer (i occasionally get hanging clients assoc. the net connection (i think) but this isn't a problem with regular monitering).

Here's my Fahlog should it be of any use;
[14:08:08] Completed 550000 out of 1000000 steps (55 percent)
[14:23:49] Writing local files
[14:23:49] Completed 560000 out of 1000000 steps (56 percent)
[14:39:29] Writing local files
[14:39:29] Completed 570000 out of 1000000 steps (57 percent)
[14:51:10] Warning: long 1-4 interactions
[14:51:10] Gromacs cannot continue further.
[14:51:10] Going to send back what have done.
[14:51:10] logfile size: 353037
[14:51:10] - Writing 353573 bytes of core data to disk...
[14:51:11] ... Done.
[14:51:11] - Failed to delete work/wudata_06.arc
[14:51:11] No C.P. to delete.
[14:51:11] - Failed to delete work/wudata_06.dyn
[14:51:11] - Failed to delete work/wudata_06.chk
[14:51:11] - Failed to delete work/wudata_06.sas
[14:51:11] - Failed to delete work/wudata_06.goe
[14:51:11] - Failed to delete work/wudata_06.xvg
[14:51:11] Warning: check for stray files
[14:51:11]
[14:51:11] Folding@home Core Shutdown: EARLY_UNIT_END
[14:51:11]
[14:51:11] Folding@home Core Shutdown: EARLY_UNIT_END
[14:51:17] CoreStatus = 7B (123)
[14:51:17] Client-core communications error: ERROR 0x7b
[14:51:17] Deleting current work unit & continuing...
[14:53:21] - Preparing to get new work unit...
[14:53:21] + Attempting to get work packet
[14:53:21] - Connecting to assignment server
[14:53:22] - Successful: assigned to (171.64.65.64).
[14:53:22] + News From Folding@Home: Welcome to Folding@Home
[14:53:22] Loaded queue successfully.
[14:53:27] + Closed connections
[14:53:32]
[14:53:32] + Processing work unit
[14:53:32] Core required: FahCore_a1.exe
[14:53:32] Core found.
[14:53:32] Working on Unit 07 [January 1 14:53:32]
[14:53:32] + Working ...
[14:53:33]
[14:53:33] *------------------------------*
[14:53:33] Folding@Home Gromacs SMP Core
[14:53:33] Version 1.74 (March 10, 2007)
[14:53:33]
[14:53:33] Preparing to commence simulation
[14:53:33] - Ensuring status. Please wait.
[14:53:33] Created dyn
[14:53:33] - Files status OK
[14:53:33] this execution.
[14:53:33] - Files status OK
[14:53:34] mpressed 507.5 percent)
[14:53:34] - Starting from initial work packet
[14:53:34]
[14:53:34] Project: 2652 (Run 0, Clone 430, Gen 44)
[14:53:34]
[14:53:34] : 2652 (Run 0, Clone 430, Gen 44)
[14:53:34]
[14:53:34] ble.
[14:53:34] Entering M.D.
[14:53:51] al work pa- Starting from initial work packet
[14:53:51]
[14:53:51] Project: 2652 (Run 0, Clone 430, Gen 44)
[14:53:51]
[14:53:51] Entering M.D.
[14:53:58] rotein
[14:53:58] Writing local files
[14:53:58] cal files
[14:53:58] boost OK.
[14:53:58] Writing local files
[14:53:58] Completed 0 out of 1000000 steps (0 percent)
[15:09:39] Writing local files
[15:09:39] Completed 10000 out of 1000000 steps (1 percent)
[15:28:27] Writing local files
[15:28:27] Completed 20000 out of 1000000 steps (2 percent)
[15:49:30] Writing local files
Folding on XPMce 32-bit in Dell9200 machine (stock) with;

C2D E6600

2GB DDR 533Mhz (Kingston)
bruce
Posts: 20824
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: Project: 2652 (Run 0 Clone 236 Gen 23) Again

Post by bruce »

Please see this recent post viewtopic.php?f=8&t=571&start=0&st=0&sk=t&sd=a

I only see one log posted. Are all the errors 0x7b?

Can anyone explain why the WU fails at the same point for user A but repeatedly fails at a different point for User B?
7im
Posts: 10179
Joined: Thu Nov 29, 2007 4:30 pm
Hardware configuration: Intel i7-4770K @ 4.5 GHz, 16 GB DDR3-2133 Corsair Vengence (black/red), EVGA GTX 760 @ 1200 MHz, on an Asus Maximus VI Hero MB (black/red), in a blacked out Antec P280 Tower, with a Xigmatek Night Hawk (black) HSF, Seasonic 760w Platinum (black case, sleeves, wires), 4 SilenX 120mm Case fans with silicon fan gaskets and silicon mounts (all black), a 512GB Samsung SSD (black), and a 2TB Black Western Digital HD (silver/black).
Location: Arizona
Contact:

Re: Project: 2652 (Run 0 Clone 236 Gen 23) Again

Post by 7im »

bruce wrote:Can anyone explain why the WU fails at the same point for user A but repeatedly fails at a different point for User B?
A fails at same point, and B fails at same "other" point?
How to provide enough information to get helpful support
Tell me and I forget. Teach me and I remember. Involve me and I learn.
Qeldroma
Posts: 4
Joined: Mon Dec 31, 2007 4:34 pm

Re: Project: 2652 (Run 0 Clone 236 Gen 23) Again

Post by Qeldroma »

7im wrote:
bruce wrote:Can anyone explain why the WU fails at the same point for user A but repeatedly fails at a different point for User B?
A fails at same point, and B fails at same "other" point?
On the same step- some crap out 3 times at step 0, another I know of and myself hung 3 times on step 3, another report of step 5- consistently 3 times on the same step- but not always at the same step number.

The client finally gave up on it- reloaded the executable, then swimmingly started and finished a 2653. Haven't seen it again in a while- but will post log if again. And yes, Bruce, they're all 7Bs.
bruce
Posts: 20824
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: Project: 2652 (Run 0 Clone 236 Gen 23) Again

Post by bruce »

Qeldroma wrote:And yes, Bruce, they're all 7Bs.
I'm really disturbed about this whole 0x7b situation. It appears to be a catch-all category with multiple causes and I don't know any good way to isolate them in a way that allows them to be fixed. Some are certainly issues that could be handled by the software; some are not. Virtually none of them are reproducible on different hardware.

The other overlapping issue that bothers me is the situation where many people have an EUE (often 3x or 4x) and then somebody manages to complete the WU successfully, taking all the pressure off of fixing whatever was wrong (and providing the all-too easy answer: It was just due to unstable hardware - - even though that may also be one of the possible causes.)

I wrote a response to this thread where it looked like we had captured a repeatable EUE and then somebody managed to complete it. Unfortunately due to a sql error, the 20 minutes I spent composing that response was wasted when the data was lost somewhere between my machine and the forum. (I'll probably write it again soon, but not now.)

I'd welcome any inputs on how to debug these issues and divide them into manageable categories. Unfortunately both overlapping issues are too big to tackled without a realistic plan that involves a list of specific known issues plus a plan that allows the issues to be divided into small enough categories that a single good programmer can attack. (I know how difficult this can be. Several of us worked hard for many, many months to isolate an errata on certain AMD CPUs and once it was documented and fixed, several of us got a nice certificates and one still hangs on my wall.)
Post Reply