Failing units, low ppd, and returned units.

If you're new to FAH and need help getting started or you have very basic questions, start here.

Moderators: Site Moderators, FAHC Science Team

Post Reply
Scarlet-Tech
Posts: 37
Joined: Tue Nov 10, 2015 9:54 pm

Failing units, low ppd, and returned units.

Post by Scarlet-Tech »

I am not new to Folding, I am new to the forums. I have been on EVGA forums and folding team for 2 years now.

My system upgrades have always been about bettering the work units competed and building a beautiful system. When I set it up, I usually make sure that my system is folding and working well when I am not gaming.

Well, I am currently out of state, and I am watching my ppd crash, work units fail constantly, and my system continues to burn electricity like there is no tomorrow.. Normally, I get 1.42-1.54m ppd and 20-30 work units completed. Lately, I am getting 14 or so work units completed and the PPD has dropped to 1.2 at the high point. This all started after the EVGA folding challenge for November started. There are lots of failed units reported, and I was wondering if Stanford has any update as to what is going on.

I will be frankly honest. I have avoided these forums since I started folding hearing that Stanford loves to blame Nvidia for failed units.. Since I have been using the same system and the problem suddenly arises without changing anything, especially nothing from nvidia, it would seem that Stanford is sending out tons of bad work units. I have never had more than a couple of failed units on the drivers I am using now, but it would seem that Stanford has still been pointing the finger away from themselves.

Could Stanford please look into this issue? It's obviously not Nvidia, since Nothing on my system changed, only the work units received.

Since Stanford is getting free use of thousands of computers, looking into this would be beneficial so that they can get more research completed. This is for everyone's benefit, but Stanford gets to use our hardware for free. They can remove the obviously bad and failing units to find what is causing them to fail, and send out units with less issues.

Also, I hear the mods like to send warning for threads like this on these forums. If you feel a warning needs to be sent, please send a good explanation as to what I have done wrong. I will copy and post this on the other forums that deal with folding and see if they can provide insight as well.


If Stanford wants free hardware to do the work for them, they should invest the time to make sure everything is smooth for those that are helping them.

System (everything is 100% stock. No over clocks thanks to failing units constantly) :

I7 5960x, Rampage V Extreme
Four 980 K|ngp|ns driver 355 (not new, downgraded from 357)
FAH Client 4.4
bruce
Posts: 20824
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: Failing units, low ppd, and returned units.

Post by bruce »

I don't see any evidence that Stanford is "sending out tons of bad work units" -- only that something has happened to your system that we can only guess about. (Without the information requested in my sig, all we can do is guess.)

I don't know if this information will be useful, but scarlet_tech apparently is running 30 slots. (Counting twice for recent reinstalls.) The last WU returned from 27 of them all seem to have earned reasonable points. The last WU from three of them have received 0 points.
2015-09-05 22:05:50 p10495 r30 c2 g33
2015-09-28 04:07:00 p9835 r68 c6 g9
2015-10-06 18:07:55 p9430 r56 c2 g111

Only the last one appears to be recent. That particular WU was reassigned and successfully completed by someone else about 9.6 hours later so it's not a bad WU.

What information can you provide to explain which work units or systems are failing "constantly" and what kind of failures are they?
Scarlet-Tech
Posts: 37
Joined: Tue Nov 10, 2015 9:54 pm

Re: Failing units, low ppd, and returned units.

Post by Scarlet-Tech »

Hi Bruce,

Currently I am in Arizona (PC in Delaware).

Folding 4 slots, all 980's.

My issues come from checking here: http://folding.extremeoverclocking.com/ ... =&u=654307

Notice that 1 week ago, I was completing many more units that I am on now. I am trying to determine if there is a hardware failure or a if this is the work units themselves.

If I could post links, which are limited for new users so I won't be able to just yet,

Currently, EVGA folding team is trying to break 3 billion points in one month, and the forums have been bustling with many users getting lots of failed units. When they started posting this information, my ppd started dropping as mentioned above.

I can see, just at this latest update, that 3 work units completed over the 3 hour time frame. This is a good thing, but going from 20+ completed work units per day to 10-14 is concerning for me.

Once I get home, 11 days from now, I will be able to post a long log off errors, as my roommate is unable to figure out how to do it. I apologize for not being able to provide the logs, but since there are many users on the EVGA forums that are providing the information with the same hardware, I will try to keep an eye open for the exact work units they are experiencing the errors with.


*edit/addition* could you provide the link where you are able to see completed and reassigned units so I would be able to watch that?
Ricky
Posts: 474
Joined: Sat Aug 01, 2015 1:34 am
Hardware configuration: 1. 2 each E5-2630 V3 processors, 64 GB RAM, GTX980SC GPU, and GTX980 GPU running on windows 8.1 operating system.
2. I7-6950X V3 processor, 32 GB RAM, 1 GTX980tiFTW, and 2 each GTX1080FTW GPUs running on windows 8.1 operating system.
Location: New Mexico

Re: Failing units, low ppd, and returned units.

Post by Ricky »

Scarlet_tech,

I went down to driver 347.88 and have had less problems. I have not detected a bad WU in the 10 days that I have been folding with this driver.
Scarlet-Tech
Posts: 37
Joined: Tue Nov 10, 2015 9:54 pm

Re: Failing units, low ppd, and returned units.

Post by Scarlet-Tech »

Ricky, I am on Windows 10, so 352 is the earliest driver I can go back to, unfortunately. I may try to get a new win 7 key when I get home if that will provide stability to the folding system.
Scarlet-Tech
Posts: 37
Joined: Tue Nov 10, 2015 9:54 pm

Re: Failing units, low ppd, and returned units.

Post by Scarlet-Tech »

bruce wrote:I don't see any evidence that Stanford is "sending out tons of bad work units" -- only that something has happened to your system that we can only guess about. (Without the information requested in my sig, all we can do is guess.)

I don't know if this information will be useful, but scarlet_tech apparently is running 30 slots. (Counting twice for recent reinstalls.) The last WU returned from 27 of them all seem to have earned reasonable points. The last WU from three of them have received 0 points.
2015-09-05 22:05:50 p10495 r30 c2 g33
2015-09-28 04:07:00 p9835 r68 c6 g9
2015-10-06 18:07:55 p9430 r56 c2 g111

Only the last one appears to be recent. That particular WU was reassigned and successfully completed by someone else about 9.6 hours later so it's not a bad WU

Could you please provide a link where you see this information, please?
bruce
Posts: 20824
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: Failing units, low ppd, and returned units.

Post by bruce »

scarlet_tech wrote:Could you please provide a link where you see this information, please?
Sorry, no. The Pande Group has restricted that data to forum Moderators only.
Scarlet-Tech
Posts: 37
Joined: Tue Nov 10, 2015 9:54 pm

Re: Failing units, low ppd, and returned units.

Post by Scarlet-Tech »

bruce wrote:
scarlet_tech wrote:Could you please provide a link where you see this information, please?
Sorry, no. The Pande Group has restricted that data to forum Moderators only.
So, it is OK to post stuff like that, but not share the link to it. Makes sense. Wouldn't want the truth out there I guess.

I will share some posts from EVGA, since they aren't hidden and may be helpful.

Since forum Moderators can look up information, my name here does not match my folding name. My folding name is Scarlet-Tech user 654307 according to extreme over clocking. The results you pulled were for scarlet_tech. I mistyped when entering my forum name.

Mekhed wrote: I'm gonna say that you're not wrong. Both of my machines have been rock solid for months folding. I had what I expected the first 3 days of the challenge and on day 4 I also dropped about 300k ppd. The last 7 days have been a struggle just to get WU's to finish and not be returned as "bad work units". I've changed drivers and lowered video card memory speeds and still having problems. You're not wrong Scarlet, something changed on day 4
Scott over at bjorn3d is even reporting having to go back multiple drivers in an attempt to find stable drivers.

Here is another user with the same hardware as me, who can report his errors that are occurring.

*********************** Log Started 2015-11-09T20:53:12Z ***********************
20:54:09:WU00:FS01:0x21:ERROR:Potential energy error of 805.531, threshold of 10
20:54:09:WU00:FS01:0x21:ERROR:Reference Potential Energy: -1.23368e+006 | Given Potential Energy: -1.23287e+006
20:54:10:WARNING:WU00:FS01:FahCore returned: BAD_WORK_UNIT (114 = 0x72)


I can go through 30 pages of conversation and copy and paste all of the information, but I can not post links, or I would just link the article and which pages to view.

The information is available to show that isn't just one of two people experiencing the issue, but our team number dropped substantially with the same number of folders pushing out units. This started on November 4th,and has been continuous since then.

I understand you are a moderator, and that I can not provide my own stats, but the evidence is overwhelming. From more than one team, and I am just raising the issue as it needs to be corrected. Since we have 20 people continuously trying to find an actual solution, it would be good to have the support of Stanford in this venture.


I will continue to post edits to this thread and provide more failed units from other members as they post them, so that it can't be ignored since our entire team and other teams are experiencing this issue:

11:25:26:WU02:FS01:0x21:Completed 1300000 out of 2000000 steps (65%)
11:25:33:WU02:FS01:0x21:Bad State detected... attempting to resume from last good checkpoint
11:25:33:WU02:FS01:0x21:Max number of retries reached. Aborting.
11:25:33:WU02:FS01:0x21:ERROR:Max Retries Reached
11:25:33:WU02:FS01:0x21:Saving result file logfile_01.txt
11:25:33:WU02:FS01:0x21:Saving result file log.txt
11:25:33:WU02:FS01:0x21:Folding@home Core Shutdown: BAD_WORK_UNIT
11:25:34:WARNING:WU02:FS01:FahCore returned: BAD_WORK_UNIT (114 = 0x72)
11:25:34:WU02:FS01:Sending unit results: id:02 state:SEND error:FAULTY project:9625 run:1 clone:1 gen:44




21:39:06:WU00:FS01:0x21:ERROR:Bad platformId size.
21:39:07:WARNING:WU00:FS01:FahCore returned: BAD_WORK_UNIT (114 = 0x72)

Log Started 2015-11-09T20:53:12Z ***********************
20:54:09:WU00:FS01:0x21:ERROR:Potential energy error of 805.531, threshold of 10
20:54:09:WU00:FS01:0x21:ERROR:Reference Potential Energy: -1.23368e+006 | Given Potential Energy: -1.23287e+006
20:54:10:WARNING:WU00:FS01:FahCore returned: BAD_WORK_UNIT (114 = 0x72)


23:11:11:WU02:FS03:0x21:ERROR:exception: Error downloading array velm: clEnqueueReadBuffer (-5)
23:11:11:WU02:FS03:0x21:Folding@home Core Shutdown: BAD_WORK_UNIT
23:11:12:WARNING:WU02:FS03:FahCore returned: BAD_WORK_UNIT (114 = 0x72)
23:11:12:WU02:FS03:Sending unit results: id:02 state:SEND error:FAULTY project:9704 run:64 clone:18 gen:72 core:0x21 unit:0x00000063ab404162553ec5d398a809a5



980HC
*********************** Log Started 2015-11-10T01:01:06Z ***********************
******************************* Date: 2015-11-10 *******************************
10:20:52:WARNING:WU02:FS03:Server did not like results, dumping
******************************* Date: 2015-11-10 *******************************
10:20:52:WU02:FS03:Upload 99.93%
10:20:52:WU02:FS03:Upload complete
10:20:52:WU02:FS03:Server responded WORK_QUIT (404)
10:20:52:WARNING:WU02:FS03:Server did not like results, dumping
10:20:52:WU02:FS03:Cleaning up


980HB
*********************** Log Started 2015-11-10T01:01:47Z ***********************
******************************* Date: 2015-11-10 *******************************
11:54:03:WARNING:WU00:FS01:Server did not like results, dumping
12:21:14:WU02:FS01:0x21:ERROR:Max Retries Reached
12:21:14:WARNING:WU02:FS01:FahCore returned: BAD_WORK_UNIT (114 = 0x72)
12:22:07:WU00:FS00:0x21:ERROR:Max Retries Reached
12:22:08:WARNING:WU00:FS00:FahCore returned: BAD_WORK_UNIT (114 = 0x72)
******************************* Date: 2015-11-10 *******************************
15:03:28:WU00:FS00:0x21:ERROR:Max Retries Reached
15:03:29:WARNING:WU00:FS00:FahCore returned: BAD_WORK_UNIT (114 = 0x72)

11:53:58:WU00:FS01:Upload 95.92%
11:54:03:WU00:FS01:Upload complete
11:54:03:WU00:FS01:Server responded WORK_QUIT (404)
11:54:03:WARNING:WU00:FS01:Server did not like results, dumping
11:54:03:WU00:FS01:Cleaning up

12:14:03:WU02:FS01:0x21:Bad State detected... attempting to resume from last good checkpoint
12:15:27:WU02:FS01:0x21:Completed 120000 out of 2000000 steps (6%)
12:16:52:WU02:FS01:0x21:Completed 140000 out of 2000000 steps (7%)
12:18:16:WU02:FS01:0x21:Completed 160000 out of 2000000 steps (8%)
12:19:41:WU02:FS01:0x21:Completed 180000 out of 2000000 steps (9%)
12:21:06:WU02:FS01:0x21:Completed 200000 out of 2000000 steps (10%)
12:21:14:WU02:FS01:0x21:Bad State detected... attempting to resume from last good checkpoint
12:21:14:WU02:FS01:0x21:Max number of retries reached. Aborting.
12:21:14:WU02:FS01:0x21:ERROR:Max Retries Reached
12:21:14:WU02:FS01:0x21:Saving result file logfile_01.txt
12:21:14:WU02:FS01:0x21:Saving result file log.txt
12:21:14:WU02:FS01:0x21:Folding@home Core Shutdown: BAD_WORK_UNIT
12:21:14:WARNING:WU02:FS01:FahCore returned: BAD_WORK_UNIT (114 = 0x72)
12:21:14:WU02:FS01:Sending unit results: id:02 state:SEND error:FAULTY project:9630 run:1 clone:23 gen:46 core:0x21 unit:0x00000041ab436c9b5609bee22119aafb
12:21:14:WU02:FS01:Uploading 9.50KiB to 171.67.108.155
12:21:14:WU02:FS01:Connecting to 171.67.108.155:8080
12:21:14:WU02:FS01:Upload complete
12:21:14:WU02:FS01:Server responded WORK_ACK (400)
12:21:14:WU02:FS01:Cleaning up

12:22:00:WU00:FS00:0x21:Completed 100000 out of 2000000 steps (5%)
12:22:07:WU00:FS00:0x21:Bad State detected... attempting to resume from last good checkpoint
12:22:07:WU00:FS00:0x21:Max number of retries reached. Aborting.
12:22:07:WU00:FS00:0x21:ERROR:Max Retries Reached
12:22:07:WU00:FS00:0x21:Saving result file logfile_01.txt
12:22:07:WU00:FS00:0x21:Saving result file log.txt
12:22:07:WU00:FS00:0x21:Folding@home Core Shutdown: BAD_WORK_UNIT
12:22:08:WARNING:WU00:FS00:FahCore returned: BAD_WORK_UNIT (114 = 0x72)
12:22:08:WU00:FS00:Sending unit results: id:00 state:SEND error:FAULTY project:9629 run:0 clone:23 gen:37 core:0x21 unit:0x0000002fab436c9b5609bee23824a870
12:22:08:WU00:FS00:Uploading 8.50KiB to 171.67.108.155
12:22:08:WU00:FS00:Connecting to 171.67.108.155:8080
12:22:08:WU00:FS00:Upload complete
12:22:08:WU00:FS00:Server responded WORK_ACK (400)
12:22:08:WU00:FS00:Cleaning up

15:03:21:WU00:FS00:0x21:Completed 300000 out of 2000000 steps (15%)
15:03:28:WU00:FS00:0x21:Bad State detected... attempting to resume from last good checkpoint
15:03:28:WU00:FS00:0x21:Max number of retries reached. Aborting.
15:03:28:WU00:FS00:0x21:ERROR:Max Retries Reached
15:03:28:WU00:FS00:0x21:Saving result file logfile_01.txt
15:03:28:WU00:FS00:0x21:Saving result file log.txt
15:03:28:WU00:FS00:0x21:Folding@home Core Shutdown: BAD_WORK_UNIT
15:03:29:WARNING:WU00:FS00:FahCore returned: BAD_WORK_UNIT (114 = 0x72)
15:03:29:WU00:FS00:Sending unit results: id:00 state:SEND error:FAULTY project:9634 run:1 clone:40 gen:14 core:0x21 unit:0x00000015ab436c9b5609bee3eb2b8f6f
15:03:29:WU00:FS00:Uploading 10.00KiB to 171.67.108.155
15:03:29:WU00:FS00:Connecting to 171.67.108.155:8080
15:03:29:WU00:FS00:Upload complete
15:03:29:WU00:FS00:Server responded WORK_ACK (400)
15:03:29:WU00:FS00:Cleaning up






These are all just a tiny example of errors that are occurring now, and I am trying to get all EVGA folders on board to post every single bad unit that is received across all platforms.. The above listed platforms are nearly identical to my system.
bruce
Posts: 20824
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: Failing units, low ppd, and returned units.

Post by bruce »

scarlet_tech wrote:
bruce wrote:I will share some posts from EVGA, since they aren't hidden and may be helpful.
Some are helpful, many are not.
Since forum Moderators can look up information, my name here does not match my folding name. My folding name is Scarlet-Tech user 654307 according to extreme over clocking. The results you pulled were for scarlet_tech. I mistyped when entering my forum name.
There's no requirement that you name match, but when I found numerous reports from the name you gave me, I made a (reasonable?) assumption. My Bad.

If you want me to correct your mis-typed name, send me a PM.
Mekhed wrote: I'm gonna say that you're not wrong. Both of my machines have been rock solid for months folding. I had what I expected the first 3 days of the challenge and on day 4 I also dropped about 300k ppd. The last 7 days have been a struggle just to get WU's to finish and not be returned as "bad work units". I've changed drivers and lowered video card memory speeds and still having problems. You're not wrong Scarlet, something changed on day 4
"rock solid for months" is NOT the same as not overclocked, and words to that effect suggest that the machine is overclocked but has been stable on previous assignments. If the GPU is overclocked you're responsible for it, not Stanford, as they do not support overclocking. Some of the new projects do use the hardware more effectively, leading to a higher that normal failure rate for machines which are overclocked -- especially overclocked VRAM on Maxwell hardware.

Several of the new projects have been intentionally restricted to client-type=beta. [The reason they're identified as beta is because they're more likely to encounter instabilities. That warning has not changed, even though the failure rate can change.] If the projects that you're reporting are beta WUs and they happen to be unstable, remove beta from your configuration until you figure out how to keep your hardware stable.
*********************** Log Started 2015-11-09T20:53:12Z ***********************
20:54:09:WU00:FS01:0x21:ERROR:Potential energy error of 805.531, threshold of 10
20:54:09:WU00:FS01:0x21:ERROR:Reference Potential Energy: -1.23368e+006 | Given Potential Energy: -1.23287e+006
20:54:10:WARNING:WU00:FS01:FahCore returned: BAD_WORK_UNIT (114 = 0x72)
Reports like this are essentially meaningless. I can only guess at the missing information since no WU is identified, no FahCore is identified, and the system being used is not documented per my previous instructions. Yes, an instability has been encountered but I hesitate to speculate about the cause beyond what I've already said.
The information is available to show that isn't just one of two people experiencing the issue, but our team number dropped substantially with the same number of folders pushing out units. This started on November 4th,and has been continuous since then.

I will continue to post edits to this thread and provide more failed units from other members as they post them, so that it can't be ignored since our entire team and other teams are experiencing this issue.
If there's a completion between one team's hardware which is overclocked and another team's hardware is stable, guess which one will win the competition.

I have discarded all reports where I even can't determine which project or server is associated with the report (let alone having any information about the system being used.)
11:25:34:WARNING:WU02:FS01:FahCore returned: BAD_WORK_UNIT (114 = 0x72)
11:25:34:WU02:FS01:Sending unit results: id:02 state:SEND error:FAULTY project:9625 run:1 clone:1 gen:44

23:11:11:WU02:FS03:0x21:ERROR:exception: Error downloading array velm: clEnqueueReadBuffer (-5)
23:11:12:WARNING:WU02:FS03:FahCore returned: BAD_WORK_UNIT (114 = 0x72)
23:11:12:WU02:FS03:Sending unit results: id:02 state:SEND error:FAULTY project:9704 run:64 clone:18 gen:72 core:0x21
11:53:58:WU00:FS01:Upload 95.92%
11:54:03:WU00:FS01:Upload complete
11:54:03:WU00:FS01:Server responded WORK_QUIT (404)
11:54:03:WARNING:WU00:FS01:Server did not like results, dumping
11:54:03:WU00:FS01:Cleaning up
There are reports of this sort of error with 171.64.65.56. Assignments from the server have been suspended until the problem can be resolved. (I'm assuming this was the server involved ... if that's not true, then I have no explanation.)
12:21:14:WARNING:WU02:FS01:FahCore returned: BAD_WORK_UNIT (114 = 0x72)
12:21:14:WU02:FS01:Sending unit results: id:02 state:SEND error:FAULTY project:9630 run:1 clone:23 gen:46 core:0x21
12:21:14:WU02:FS01:Uploading 9.50KiB to 171.67.108.155
12:21:14:WU02:FS01:Connecting to 171.67.108.155:8080

12:22:08:WARNING:WU00:FS00:FahCore returned: BAD_WORK_UNIT (114 = 0x72)
12:22:08:WU00:FS00:Sending unit results: id:00 state:SEND error:FAULTY project:9629 run:0 clone:23 gen:37 core:0x21

15:03:29:WARNING:WU00:FS00:FahCore returned: BAD_WORK_UNIT (114 = 0x72)
15:03:29:WU00:FS00:Sending unit results: id:00 state:SEND error:FAULTY project:9634 run:1 clone:40 gen:14 core:0x21

These are all just a tiny example of errors that are occurring now, and I am trying to get all EVGA folders on board to post every single bad unit that is received across all platforms.. The above listed platforms are nearly identical to my system.
Summary:
FAH has no way to deal with unstable hardware except to reassign the WU to someone else. When a project is aborted, the error is noted and often partial credit is awarded. When the reissued WU is successfully completed, full credit is granted and FAH moves on to the next WU (assuming that the first machine was unstable and the one who completed it was stable.) Here's what I see for the WUs mentioned above.

project:9625 run:1 clone:1 gen:44
Error reported by Team: 111065 (partial credit) and successfully completed by (team 86565) for full credit

project:9704 run:64 clone:18 gen:72
Partial credit (error) to Team: 111065 and Team: 13531. Full credit (no error) awarded to Team: 161747 (Third try)

project:9630 run:1 clone:23 gen:46
Partial points awarded to Team: 111065 and to Team: 111065 and full points awarded to Team: 32

project:9629 run:0 clone:23 gen:37
Partial points awarded to Team: 37651 and full points awarded to Team: 111065

project:9634 run:1 clone:40 gen:14
Partial points awarded to Team: 37651 and to Team: 111065 and full points awarded to Team: 224497

I've only reported the team numbers. Some of the names associate with the failures seem to be repeated but I won't report that unless the person themself asks.

This research has taken me almost an hour, but it does seem to indicate that several machines are marginally stable and they can't handle the increased utilization that these projects are seeking. This certainly is not the first time that FAH has created a more stressful benchmark that the benchmark routines commonly used by overclockers.
Scarlet-Tech
Posts: 37
Joined: Tue Nov 10, 2015 9:54 pm

Re: Failing units, low ppd, and returned units.

Post by Scarlet-Tech »

bruce wrote: This research has taken me almost an hour, but it does seem to indicate that several machines are marginally stable and they can't handle the increased utilization that these projects are seeking. This certainly is not the first time that FAH has created a more stressful benchmark that the benchmark routines commonly used by overclockers.

None of my hardware is overclocked.

I think the one with the most informations was the one that had previously been overclocked, but they have lowered everything back to stock, so I will talk to them and see if they can lower them even more.

Again, the only thing I had overclocked on my system was the cpu, as there is little or no point in over clocking 4 gpu's on a daily system. The course overclock has already been removed. I do understand that over clocking cause system instability.

I have requested the users to post log files that we can update, as this is the only public information they have passed on at this time.


I know links are limited for new members, so I am breaking this up so it will come through to a Google Drive link.

drive(.)google(.)com/folderview?id=0BylHzRH2Ab3FTUtXN0tfeGRPYXM&usp=sharing
Scarlet-Tech
Posts: 37
Joined: Tue Nov 10, 2015 9:54 pm

Re: Failing units, low ppd, and returned units.

Post by Scarlet-Tech »

5:13:02:WU02:FS01:0x21:Folding@home GPU Core21 [link=mailto:Folding@home]Folding@home[/link] Core
15:13:02:WU02:FS01:0x21:Version 0.0.12
15:13:04:WU00:FS01:Upload 5.07%
15:13:10:WU00:FS01:Upload 10.15%
15:13:16:WU00:FS01:Upload 15.22%
15:13:23:WU00:FS01:Upload 21.31%
15:13:29:WU00:FS01:Upload 27.40%
15:13:35:WU00:FS01:Upload 32.48%
15:13:41:WU00:FS01:Upload 37.55%
15:13:41:WU02:FS01:0x21:ERROR:exception: bad allocation
15:13:41:WU02:FS01:0x21:Saving result file logfile_01.txt
15:13:41:WU02:FS01:0x21:Saving result file log.txt
15:13:41:WU02:FS01:0x21:Folding@home Core Shutdown: BAD_WORK_UNIT
15:13:42:WARNING:WU02:FS01:FahCore returned: BAD_WORK_UNIT (114 = 0x72)
15:13:42:WU02:FS01:Sending unit results: id:02 state:SEND error:FAULTY project:9205 run:16 clone:52 gen:7 core:0x21 unit:0x00000026664f2dd055d4d0238de55484
15:13:42:WU02:FS01:Uploading 2.27KiB to 171.64.65.104
15:13:42:WU02:FS01:Connecting to 171.64.65.104:8080
15:13:42:WU03:FS01:Connecting to 171.67.108.45:80
15:13:43:WU02:FS01:Upload complete
15:13:43:WU02:FS01:Server responded WORK_ACK (400)

15:13:59:WU03:FS01:0x21:Folding@home GPU Core21 [link=mailto:Folding@home]Folding@home[/link] Core
15:13:59:WU03:FS01:0x21:Version 0.0.12
15:14:01:WU00:FS01:Upload 55.82%
15:14:07:WU00:FS01:Upload 60.89%
15:14:13:WU00:FS01:Upload 65.97%
15:14:19:WU00:FS01:Upload 71.04%
15:14:25:WU00:FS01:Upload 76.11%
15:14:32:WU00:FS01:Upload 82.20%
15:14:39:WU00:FS01:Upload 88.29%
15:14:39:WU03:FS01:0x21:ERROR:exception: bad allocation
15:14:39:WU03:FS01:0x21:Saving result file logfile_01.txt
15:14:39:WU03:FS01:0x21:Saving result file log.txt
15:14:39:WU03:FS01:0x21:Folding@home Core Shutdown: BAD_WORK_UNIT
15:14:40:WARNING:WU03:FS01:FahCore returned: BAD_WORK_UNIT (114 = 0x72)
15:14:40:WU03:FS01:Sending unit results: id:03 state:SEND error:FAULTY project:9205 run:3 clone:33 gen:9 core:0x21 unit:0x00000039664f2dd055d4c97cfcd240bc
15:14:40:WU03:FS01:Uploading 2.27KiB to 171.64.65.104
15:14:40:WU03:FS01:Connecting to 171.64.65.104:8080
15:14:40:WU02:FS01:Connecting to 171.67.108.45:80
15:14:40:WU03:FS01:Upload complete
15:14:41:WU03:FS01:Server responded WORK_ACK (400)

20:36:46:WU02:FS02:0x21:Folding@home GPU Core21 [link=mailto:Folding@home]Folding@home[/link] Core
20:36:46:WU02:FS02:0x21:Version 0.0.12
20:36:49:WU01:FS02:Upload 5.03%
20:36:55:WU01:FS02:Upload 9.34%
20:37:01:WU01:FS02:Upload 14.37%
20:37:07:WU01:FS02:Upload 19.40%
20:37:13:WU01:FS02:Upload 23.71%
20:37:19:WU01:FS02:Upload 28.02%
20:37:20:WU02:FS02:0x21:ERROR:exception: bad allocation
20:37:20:WU02:FS02:0x21:Saving result file logfile_01.txt
20:37:20:WU02:FS02:0x21:Saving result file log.txt
20:37:20:WU02:FS02:0x21:Folding@home Core Shutdown: BAD_WORK_UNIT
20:37:20:WARNING:WU02:FS02:FahCore returned: BAD_WORK_UNIT (114 = 0x72)
20:37:20:WU02:FS02:Sending unit results: id:02 state:SEND error:FAULTY project:9206 run:0 clone:1351 gen:11 core:0x21 unit:0x00000040664f2dd056202ac9970f0f5c
20:37:20:WU02:FS02:Uploading 2.27KiB to 171.64.65.104
20:37:20:WU02:FS02:Connecting to 171.64.65.104:8080
20:37:20:WU03:FS02:Connecting to 171.67.108.45:80
20:37:21:WU03:FS02:Assigned to work server 171.64.65.58
20:37:21:WU03:FS02:Requesting new work unit for slot 02: READY gpu:1:GK104 [GeForce GTX 770] from 171.64.65.58
20:37:21:WU03:FS02:Connecting to 171.64.65.58:8080
20:37:22:WU03:FS02:Downloading 883.89KiB
20:37:24:WU02:FS02:Upload complete
20:37:24:WU02:FS02:Server responded WORK_ACK (400)

06:26:04:WU00:FS01:0x21:Folding@home GPU Core21 [link=mailto:Folding@home]Folding@home[/link] Core
06:26:04:WU00:FS01:0x21:Version 0.0.12
06:26:05:WU01:FS01:Upload 6.31%
06:26:11:WU01:FS01:Upload 12.62%
06:26:17:WU01:FS01:Upload 18.93%
06:26:23:WU01:FS01:Upload 25.24%
06:26:29:WU01:FS01:Upload 31.55%
06:26:35:WU01:FS01:Upload 37.87%
06:26:41:WU01:FS01:Upload 44.18%
06:26:47:WU01:FS01:Upload 50.49%
06:26:53:WU01:FS01:Upload 57.85%
06:26:59:WU01:FS01:Upload 64.16%
06:27:05:WU01:FS01:Upload 70.47%
06:27:10:WU00:FS01:0x21:ERROR:exception: bad allocation
06:27:10:WU00:FS01:0x21:Saving result file logfile_01.txt
06:27:10:WU00:FS01:0x21:Saving result file log.txt
06:27:10:WU00:FS01:0x21:Folding@home Core Shutdown: BAD_WORK_UNIT
06:27:10:WARNING:WU00:FS01:FahCore returned: BAD_WORK_UNIT (114 = 0x72)
06:27:10:WU00:FS01:Sending unit results: id:00 state:SEND error:FAULTY project:9207 run:0 clone:22 gen:32 core:0x21 unit:0x00000038664f2dd055e91e2ca7f835bb
06:27:11:WU00:FS01:Uploading 2.28KiB to 171.64.65.104
06:27:11:WU00:FS01:Connecting to 171.64.65.104:8080
06:27:11:WU02:FS01:Connecting to 171.67.108.45:80
06:27:11:WU01:FS01:Upload 76.78%
06:27:11:WU00:FS01:Upload complete
06:27:11:WU00:FS01:Server responded WORK_ACK (400)

03:54:56:WU02:FS01:0x21:Folding@home GPU Core21 [link=mailto:Folding@home]Folding@home[/link] Core
03:54:56:WU02:FS01:0x21:Version 0.0.12
03:55:01:WU00:FS01:Upload 12.89%
03:55:07:WU00:FS01:Upload 19.33%
03:55:13:WU00:FS01:Upload 25.78%
03:55:19:WU00:FS01:Upload 32.22%
03:55:25:WU00:FS01:Upload 38.67%
03:55:31:WU00:FS01:Upload 46.19%
03:55:37:WU00:FS01:Upload 52.63%
03:55:43:WU00:FS01:Upload 59.08%
03:55:49:WU00:FS01:Upload 65.52%
03:55:55:WU00:FS01:Upload 71.97%
03:56:00:WU02:FS01:0x21:ERROR:exception: bad allocation
03:56:00:WU02:FS01:0x21:Saving result file logfile_01.txt
03:56:00:WU02:FS01:0x21:Saving result file log.txt
03:56:00:WU02:FS01:0x21:Folding@home Core Shutdown: BAD_WORK_UNIT
03:56:00:WARNING:WU02:FS01:FahCore returned: BAD_WORK_UNIT (114 = 0x72)
03:56:00:WU02:FS01:Sending unit results: id:02 state:SEND error:FAULTY project:9209 run:0 clone:50 gen:15 core:0x21 unit:0x00000025664f2dd055edef6ef0fe2de2
03:56:00:WU02:FS01:Uploading 2.27KiB to 171.64.65.104
03:56:00:WU02:FS01:Connecting to 171.64.65.104:8080
03:56:01:WU03:FS01:Connecting to 171.67.108.45:80
03:56:01:WU02:FS01:Upload complete
03:56:01:WU02:FS01:Server responded WORK_ACK (400)

13:52:20:WU02:FS01:0x18:Folding@home Core Shutdown: FINISHED_UNIT
13:52:21:WU02:FS01:FahCore returned: FINISHED_UNIT (100 = 0x64)
13:52:21:WU02:FS01:Sending unit results: id:02 state:SEND error:NO_ERROR project:9430 run:212 clone:9 gen:20 core:0x18 unit:0x00000016ab40413855475025d9ebb6d9
13:52:21:WU02:FS01:Uploading 24.02MiB to 171.64.65.56
13:52:21:WU02:FS01:Connecting to 171.64.65.56:8080
13:52:21:WU00:FS01:Starting
13:52:21:WU00:FS01:Running FahCore: "C:\Program Files (x86)\FAHClient/FAHCoreWrapper.exe" C:/Users/Download/AppData/Roaming/FAHClient/cores/web.stanford.edu/~pande/Win32/AMD64/NVIDIA/Fermi/Core_18.fah/FahCore_18.exe -dir 00 -suffix 01 -version 704 -lifeline 2016 -checkpoint 3 -gpu 0 -gpu-vendor nvidia
13:52:21:WU00:FS01:Started FahCore on PID 9584
13:52:21:WU00:FS01:Core PID:7240
13:52:21:WU00:FS01:FahCore 0x18 started
13:52:22:WU00:FS01:0x18:*********************** Log Started 2015-11-10T13:52:22Z ***********************
13:52:22:WU00:FS01:0x18:Project: 10486 (Run 0, Clone 22, Gen 56)
13:52:22:WU00:FS01:0x18:Unit: 0x0000005c538b3dbb54aec97e35fdfd8b
13:52:22:WU00:FS01:0x18:CPU: 0x00000000000000000000000000000000
13:52:22:WU00:FS01:0x18:Machine: 1
13:52:22:WU00:FS01:0x18:Reading tar file state.xml
13:52:23:WU00:FS01:0x18:Reading tar file system.xml
13:52:24:WU00:FS01:0x18:Reading tar file integrator.xml
13:52:24:WU00:FS01:0x18:Reading tar file core.xml
13:52:24:WU00:FS01:0x18:Digital signatures verified
13:52:24:WU00:FS01:0x18:Folding@home GPU core18
13:52:24:WU00:FS01:0x18:Version 0.0.4
13:52:27:WU02:FS01:Upload 23.15%
13:52:33:WU02:FS01:Upload 58.01%
13:52:39:WU02:FS01:Upload 94.96%
13:52:40:WU02:FS01:Upload complete
13:52:40:WU02:FS01:Server responded WORK_QUIT (404)
13:52:40:WARNING:WU02:FS01:Server did not like results, dumping
bruce
Posts: 20824
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: Failing units, low ppd, and returned units.

Post by bruce »

The error message "bad allocation" is new to me so I'd like to gather as much information as possible to pass on to the developers.
Here's an edited transcript of what you posted.
scarlet_tech wrote:15:13:02:WU02:FS01:0x21:Version 0.0.12
15:13:41:WU02:FS01:0x21:ERROR:exception: bad allocation
15:13:41:WU02:FS01:0x21:Folding@home Core Shutdown: BAD_WORK_UNIT
15:13:42:WU02:FS01:Sending unit results: id:02 state:SEND error:FAULTY project:9205 run:16 clone:52 gen:7 core:0x21

15:14:39:WU03:FS01:0x21:ERROR:exception: bad allocation
15:14:39:WU03:FS01:0x21:Folding@home Core Shutdown: BAD_WORK_UNIT
15:14:40:WU03:FS01:Sending unit results: id:03 state:SEND error:FAULTY project:9205 run:3 clone:33 gen:9 core:0x21

20:37:20:WU02:FS02:0x21:ERROR:exception: bad allocation
20:37:20:WU02:FS02:0x21:Folding@home Core Shutdown: BAD_WORK_UNIT
20:37:20:WU02:FS02:Sending unit results: id:02 state:SEND error:FAULTY project:9206 run:0 clone:1351 gen:11 core:0x21

06:27:10:WU00:FS01:0x21:Folding@home Core Shutdown: BAD_WORK_UNIT
06:27:10:WARNING:WU00:FS01:FahCore returned: BAD_WORK_UNIT (114 = 0x72)
06:27:10:WU00:FS01:Sending unit results: id:00 state:SEND error:FAULTY project:9207 run:0 clone:22 gen:32 core:0x21

03:56:00:WU02:FS01:0x21:ERROR:exception: bad allocation
03:56:00:WU02:FS01:0x21:Folding@home Core Shutdown: BAD_WORK_UNIT
03:56:00:WU02:FS01:Sending unit results: id:02 state:SEND error:FAULTY project:9209 run:0 clone:50 gen:15 core:0x21
13:52:40:WU02:FS01:Upload complete
13:52:40:WU02:FS01:Server responded WORK_QUIT (404)
13:52:40:WARNING:WU02:FS01:Server did not like results, dumping

13:52:20:WU02:FS01:0x18:Folding@home Core Shutdown: FINISHED_UNIT
13:52:21:WU02:FS01:Sending unit results: id:02 state:SEND error:NO_ERROR project:9430 run:212 clone:9 gen:20 core:0x18

13:52:22:WU00:FS01:0x18:Project: 10486 (Run 0, Clone 22, Gen 56)
13:52:24:WU00:FS01:0x18:Version 0.0.4
A) An unknown issue that repeats: exception: bad allocation
1: FAULTY project:9205 run:16 clone:52 gen:7 core:0x21
2: FAULTY project:9205 run:3 clone:33 gen:9 core:0x21
3: FAULTY project:9206 run:0 clone:1351 gen:11 core:0x21
4: FAULTY project:9207 run:0 clone:22 gen:32 core:0x21
5: FAULTY project:9209 run:0 clone:50 gen:15 core:0x21

B) Some other unknown issue
03:56:00:WU02:FS01:0x21:ERROR:exception: bad allocation
FAULTY project:9209 run:0 clone:50 gen:15 core:0x21
Server did not like results, dumping

C) Some WUs do complete successfully
NO_ERROR project:9430 run:212 clone:9 gen:20 core:0x18

D) A new WU has started and may or may not complete successfully.
Project: 10486 (Run 0, Clone 22, Gen 56)

Questions (you may have already answered, but confirm them in one place):
What OS are you running ... including 32-bit or 64-bit?
What else is running?
Provide a detailed description of the hardware is being used, including clock rates.
Is there any possibility of unexpected conditions (like disk-full)?
Which drivers are being used?

A1:Several failures. One Completion.
A2:Several failures. Still being redistributed.
A3:Several failures. One Completion.
A4:Several failures. One Completion.
A5:Several failures. Still being redistributed.
Same or different?
B1:Several failures. Still being redistributed

These WUs all show too high a failure rate. Development will be interested in attempting to reproduce and diagnose the failures.

C & D: something is strange here. More investigation is needed.
Scarlet-Tech
Posts: 37
Joined: Tue Nov 10, 2015 9:54 pm

Re: Failing units, low ppd, and returned units.

Post by Scarlet-Tech »

I am trying to make sure I get these on Google docs and such so that I don't completely clog the forum. I appreciate you passing up the info.
bruce
Posts: 20824
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: Failing units, low ppd, and returned units.

Post by bruce »

Scarlet-Tech wrote:So, it is OK to post stuff like that, but not share the link to it. Makes sense. Wouldn't want the truth out there I guess.
You must spend a lot of your time looking for conspiracies under every bush. As you can see from my posts, there's no problem getting the truth from a server that's overburdened.

If it were opened to everyone, it would be swamped with a multitude of requests from a multitude of Donors. Collectively, the Mods submit perhaps 30 transactions a week. If it were open to everybody, it would be getting perhaps 30 hits per hour and it simply isn't designed to handle that kind of load. Even so, I've seen it take several minutes to respond to a fairly simple request -- but I don't choose to gripe about the minor inconveniences in life.
Kebast
Posts: 386
Joined: Thu Aug 06, 2015 5:21 pm

Re: Failing units, low ppd, and returned units.

Post by Kebast »

Is there an easy tool available to parse log files? I don't have the time to check every WU on my machines every day, but I'd gladly drop the logs into a parsing tool and submit those results. I've not caught any WU failures that weren't my fault, but that's not to say I haven't missed any. I take it from the discussion above (and the fact that beta testers exist) that even though the information might be available on the server, it's not being reviewed or reported for analysis? In any case, if there's an easy way I can help with that, I'd be glad to.
Image
Ryzen 5900x 12T - RTX 4070 TI
Post Reply