Project: 6601(Run6, Gen201, Clone22)

Moderators: Site Moderators, FAHC Science Team

Post Reply
Mr. Scary
Posts: 35
Joined: Fri Jul 04, 2008 7:13 pm
Hardware configuration: The Rock-XPS630i/QX6850@3.67ghz/OCZ Plat 1066 2x2gb/(2)8800GT AlphaDogs/(2)10krpm Raptors Raid0&(2)WD HDD 500gb
Nekid-CoolerMaster CM690/MSI-P6N Diamond/Q6600@3.0ghz/Samsung 2x2gb 800mhz/(2)9800GT's/TX2 HSF/Corsair TX850W PSU/(2)WD 160gb HDD's Raid 0, (1)500gb WD & (1)1.5tb WD HDD
Bare Nekid-MSI-P6N Diamond/E7200@3.2ghz/OCZ ReaperX1000mhz 2x2gb/(2)8800GT's/Corsair TX750W PSU/160gb WD HDD
Buck Nekid-Xion XON-303/Dell'd 650iMobo/Q6600@3.15ghz/Naya 4x1gb 667mhz/(2)8800GT's SLI/750W PSU/320gb WD HDD

Project: 6601(Run6, Gen201, Clone22)

Post by Mr. Scary »

This WU runs anywhere from 8-18% then gives an unstable_machine, over and over till EUE pausing 24hrs.

Machine is running 2 9800GX2's, 4gpu clients
Standard, stable, and has been running for over 3 weeks without a hiccup.
Crunching anything and everything.
I deleted the fah log(s), my folding, que, and all work.
Server then sends the same D@mn WU back.

Pull this WU please.
With the technology that Stanford/We have and the fact that the server(s) can recognize a particular and actually 'everyone', who folding, then why or why do we get assigned the same d@mn WU?

:::::::Equipment/Server Conversation::::::

Equipment: "hey i love folding, but this work unit has errors"
Server: "hey equipment, thanks for folding with us, let me send you some more work"
Equipment: "thanks server, i'm ready send me something"
Server: "ok equipment, here you go"<<<sendsendsend>>>
Equipment: <<<receivereceivereceive>>> ;;;talking to itself;;;hey this is the same work unit i just had trouble with, well guess i'll try it again.
Result: same error
Equipment: "hey server, here's what i finished, send me a different WU as either this one is bad or I just can't run it"
CURRENTLY;;;;;;;;
Server: "here ya go Equipment, i'm sending you the same work unit, oh well if it doesn't run!"
FUTURE/SHOULD BE:
Server: "hey equipment, thanks for the work you've finished, either this WU is bad or just not gonna run on you, sooooo, as i have the capability to identify you,,,,here's a different WU, keep up the good work, and thanks again for folding!"


Result: More science continues to get done!

Hmmmmmmmmmmm!!!!!

M$
bruce
Posts: 20824
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: Project: 6601(Run6, Gen201, Clone22)

Post by bruce »

Obviously you didn't copy/paste the PRCG values into your title. Perhaps you mean Project: 6601 (Run 6, Clone 22, Gen 201) and perhaps you mean Project: 6601 (Run 6, Clone 201, Gen 22).

Project: 6601 (Run 6, Clone 22, Gen 201) has not yet been completed but I doubt that it has even been issued since Project: 6601 (Run 6, Clone 22, Gen 200) has not yet been completed either and it's customary to issue Gen 201 from the results of Gen 200.

Project: 6601 (Run 6, Clone 201, Gen 22) has been successfuly completed by a number of people. I suspect that this is the one you had trouble with, and each time you failed to complete it, it was assigned to someone else. How many times would that be?

Of course there's one more option to your suggested dialog that you didn't consider:
FUTURE/SHOULD BE:
Server: "hey equipment, thanks for the work you tried to finish, either this WU is bad or your hardware has a problem, sooooo, we have already sent this WU to someone else.....we'll wait to see what they can do with it. i have the capability to identify you,,,,and if they complete it successfully, we'll mark your hardware as bad and shut down your system until you figure out how to fix it, and thanks again for folding! but we really need reliable hardware if we're going to get good scientific results."

Result: More science continues to get done with fewer errors!

Of course that answer might not be popular with the donors, but there is a certain truth to it.
Mr. Scary
Posts: 35
Joined: Fri Jul 04, 2008 7:13 pm
Hardware configuration: The Rock-XPS630i/QX6850@3.67ghz/OCZ Plat 1066 2x2gb/(2)8800GT AlphaDogs/(2)10krpm Raptors Raid0&(2)WD HDD 500gb
Nekid-CoolerMaster CM690/MSI-P6N Diamond/Q6600@3.0ghz/Samsung 2x2gb 800mhz/(2)9800GT's/TX2 HSF/Corsair TX850W PSU/(2)WD 160gb HDD's Raid 0, (1)500gb WD & (1)1.5tb WD HDD
Bare Nekid-MSI-P6N Diamond/E7200@3.2ghz/OCZ ReaperX1000mhz 2x2gb/(2)8800GT's/Corsair TX750W PSU/160gb WD HDD
Buck Nekid-Xion XON-303/Dell'd 650iMobo/Q6600@3.15ghz/Naya 4x1gb 667mhz/(2)8800GT's SLI/750W PSU/320gb WD HDD

Re: Project: 6601(Run6, Gen201, Clone22)

Post by Mr. Scary »

Thanks for the response, i think.
you are correct on not copy/paste into the title. i stand corrected in flip flopping clone and gen. Your second guess was correct.
Project: 6601 (Run 6, Clone 201, Gen 22)
[14:55:13]
[14:55:13] Assembly optimizations on if available.
[14:55:13] Entering M.D.
[14:55:18] Tpr hash work/wudata_01.tpr: 1553726238 4085261526 2586447857 1828628838 865053125
[14:55:18]
[14:55:18] Calling fah_main args: 14 usage=100
[14:55:18]
[14:55:19] Working on Protein
[14:55:22] Client config found, loading data.
[14:55:22] Starting GUI Server
[14:56:45] Completed 1%
[14:58:00] Completed 2%
[14:59:14] Completed 3%
[15:00:28] Completed 4%
[15:01:42] Completed 5%
[15:02:56] Completed 6%
[15:04:10] Completed 7%
[15:05:24] Completed 8%
[15:06:38] Completed 9%
[15:07:52] Completed 10%
[15:09:06] Completed 11%
[15:10:21] Completed 12%
[15:11:35] Completed 13%
[15:12:49] Completed 14%
[15:14:03] Completed 15%
[15:15:17] Completed 16%
[15:16:29] Completed 17%
[15:16:29] mdrun_gpu returned
[15:16:29] NANs detected on GPU

Send it to someone else, great. probably should have done that instead of trying to send it to me 4-5 times in a row. that was this session, not prior to the previous of 24 hr pausing.

As for marking my hardware bad? Kinda of assumptive/arrogant on the Server's part is how i take that comment. Guess the line of running 3+weeks without a hiccup pretty much doesn't matter.?.?.?
Great and congratz to others who are running. wouldn't it be just as easy to mark the hardware as DON'T SEND THIS OR THESE WU's there?
That as opposed to just trashing the other possible, RELIABLE, science that's being produced by that rig/user/DONOR?
No problem, i'll shut that one down as well as my 10-12 other clients down if you think it's unreliable. That as well as the other 2 builds/FOLDERS in assembly currently. I douby my puny contribution will be missed, my teamS' can pick up the 100K ppd elsewhere, i have no doubt. With this economy, i'm sure i can find other uses for the time, $$, and electricity

Then again, who does that benefit?

Thanks again for the response.
Takin' a deep breath now and pausing.

M$
Post Reply