Project 2499: Run 191, Clone 3, Gen 1

Moderators: Site Moderators, FAHC Science Team

Post Reply
bapriebe
Posts: 44
Joined: Sun Apr 20, 2008 8:33 am
Hardware configuration: HP xw4600 workstation (4GB)+Q9650+Sapphire Vapor-X HD4890,
HP Z600 workstation (4GB)+2xXEON E5540+Sapphire HD5770,
HP ML350 server (4GB)+2xXEON E5520+Diamond HD3850
Location: Ottawa, Ontario

Project 2499: Run 191, Clone 3, Gen 1

Post by bapriebe »

Many reproducible failures with this WU. At last count, this WU has aborted on error 0x77 (UNKNOWN_ERROR) 6 times in a row at 20 minutes past the 7% completion mark each time. (The other 7 CPU clients running on the same machine exhibit no problems at all.)

Code: Select all

[17:44:57] Working on queue slot 08 [July 5 17:44:57 UTC]
[17:44:57] + Working ...
[17:44:57] - Calling '.\FahCore_78.exe -dir work/ -suffix 08 -nocpulock -checkpoint 15 -verbose -lifeline 5780 -version 623'

[17:44:58] 
[17:44:58] *------------------------------*
[17:44:58] Folding@Home Gromacs Core
[17:44:58] Version 1.90 (March 8, 2006)
[17:44:58] 
[17:44:58] Preparing to commence simulation
[17:44:58] - Looking at optimizations...
[17:44:58] - Files status OK
[17:45:02] - Expanded 2772266 -> 15008001 (decompressed 541.3 percent)
[17:45:35] 
[17:45:35] Project: 2499 (Run 191, Clone 3, Gen 1)
[17:45:35] 
[17:45:35] Assembly optimizations on if available.
[17:45:35] Entering M.D.
[17:45:56] (Starting from checkpoint)
[17:45:56] Protein: Translocon_ALX2
[17:45:56] 
[17:45:56] Writing local files
[17:45:56] Completed 8800 out of 500000 steps  (2%)
[17:45:57] Extra SSE boost OK.
[17:51:24] Writing local files
[17:51:24] Completed 10000 out of 500000 steps  (2%)
[18:15:48] Writing local files
[18:15:48] Completed 15000 out of 500000 steps  (3%)
[18:42:25] Writing local files
[18:42:25] Completed 20000 out of 500000 steps  (4%)
[19:10:43] Writing local files
[19:10:43] Completed 25000 out of 500000 steps  (5%)
[19:41:17] Writing local files
[19:41:17] Completed 30000 out of 500000 steps  (6%)
[20:13:25] Writing local files
[20:13:25] Completed 35000 out of 500000 steps  (7%)
[20:32:39] 
[20:32:39] Folding@home Core Shutdown: UNKNOWN_ERROR
[20:32:42] CoreStatus = 77 (119)
[20:32:42] Client-core communications error: ERROR 0x77
[20:32:42] This is a sign of more serious problems, shutting down.
susato
Site Moderator
Posts: 511
Joined: Fri Nov 30, 2007 4:57 am
Location: Team MacOSX
Contact:

Re: Project 2499: Run 191, Clone 3, Gen 1

Post by susato »

Thanks for reporting the problem. If you haven't deleted it already, go ahead (run the client with the - delete xx flag, where xx is its queue position) and try for another one.
DeeGee
Posts: 61
Joined: Thu Oct 02, 2008 1:15 pm
Hardware configuration: Asus Crosshair Hero VIII, AMD Ryzen 3950x, 2x8GB 3600MHz DDR4, Radeon VII, Win10
Asus Crosshair Hero VII, Amd Ryzen 3900x, 2x16GB 3200MHz DDR4, GeForce 980 TI, Kubuntu 19.10
Location: Finland

Re: Project 2499: Run 191, Clone 3, Gen 1

Post by DeeGee »

I'm confirming this. The same unit has been haunting me for some time now. If i delete the queu/workunit files I still keep getting it... :evil:

[edit] Yep, I have deleted the same workunit several times now from the queue and the server still tries to give it to me. Annoying.
[edit] After several tries the server finally gave me something else to crunch. Let's see if the same unit comes to haunt me again later.
Teddy
Posts: 134
Joined: Tue Feb 12, 2008 3:05 am
Location: Canberra, Australia
Contact:

Re: Project 2499: Run 191, Clone 3, Gen 1

Post by Teddy »

I wish Stanford would delete this rogue unit one of our members has suffered multiple failures @ 7% with this unit.

[16:40:54] Project: 2499 (Run 191, Clone 3, Gen 1)
[16:40:54]
[16:40:55] Assembly optimizations on if available.
[16:40:55] Entering M.D.
[16:41:02] Protein: Translocon_ALX2
[16:41:02]
[16:41:03] Writing local files
[16:41:05] Extra SSE boost OK.
[16:41:06] Writing local files
[16:41:06] Completed 0 out of 500000 steps (0%)
[17:26:37] Writing local files
[17:26:37] Completed 5000 out of 500000 steps (1%)
[18:14:30] Writing local files
[18:14:30] Completed 10000 out of 500000 steps (2%)
[19:04:46] Writing local files
[19:04:46] Completed 15000 out of 500000 steps (3%)
[20:07:10] Writing local files
[20:07:10] Completed 20000 out of 500000 steps (4%)
[21:22:23] Writing local files
[21:22:24] Completed 25000 out of 500000 steps (5%)
[22:42:57] Writing local files
[22:42:59] Completed 30000 out of 500000 steps (6%)
[00:08:07] Writing local files
[00:08:08] Completed 35000 out of 500000 steps (7%)
[01:04:42]
[01:04:43] Folding@home Core Shutdown: UNKNOWN_ERROR
[01:04:46] CoreStatus = 77 (119)
[01:04:46] Client-core communications error: ERROR 0x77
[01:04:46] This is a sign of more serious problems, shutting down.

Always 7%, but if he stops it early and restart it throws the same error.

This "rogue" unit has been in circulation for OVER 2 months now, can't it be REMOVED from the server instead of keep handing it out & failing multiple times thus causing immense frustration. The user eventually removed the -advmethods flag on my advice & grabbed a different unit.

Teddy
ppetrone
Pande Group Member
Posts: 115
Joined: Wed Dec 12, 2007 6:20 pm
Location: Stanford
Contact:

Re: Project 2499: Run 191, Clone 3, Gen 1

Post by ppetrone »

The unit has been deleted. I apologize for the delay and the inconveniences.

Paula
Teddy
Posts: 134
Joined: Tue Feb 12, 2008 3:05 am
Location: Canberra, Australia
Contact:

Re: Project 2499: Run 191, Clone 3, Gen 1

Post by Teddy »

Thanks Paula,

I'll let my team member know..
he'll put the -advmethods flag back on & keep a close eye on things.

Teddy
Post Reply