Page 1 of 3

GTX 780 BAd Work Units?

Posted: Fri Oct 10, 2014 4:56 pm
by s/j
I still have the issue from a prior closed thread in the Drivers posts as follows: (Does this still seem to be an AS issue? Should I just continue to wait for Stanford to refine the AS? Thoughts?):

I have a new build, folded great for a week, now one of the GPUs (Zotac GTX 780 fails folding evrytime. GTX 970 folds great.
I have tried swapping PCIE16 positions as well as pausing folding, deleting work folder and re-booting.
Using MSI Afterburner, from day 1 the GPUs have been underclocked 120 mhz and GPU temps are consistently in the low 70's C.
Build is AMD FX-8350, 16 GB RAM, Gigabyte990FXA-UD3 Motherboard, 1000w Rosewill PSU. GPU 1 is EVGA GTX 970, GPU 2 is Zotac GTX 780 OC. Driver Version is 9.18.13.4411. (344.16)
Here is some log on a GPU failure:
Log Started 2014-10-06T19:46:01Z ***********************
19:46:02:WU03:FS01:0x17:Project: 13001 (Run 236, Clone 3, Gen 11)
19:46:02:WU03:FS01:0x17:Unit: 0x0000001d538b3db7532892a3432b10e4
19:46:02:WU03:FS01:0x17:CPU: 0x00000000000000000000000000000000
19:46:02:WU03:FS01:0x17:Machine: 1
19:46:02:WU03:FS01:0x17:Reading tar file state.xml
19:46:02:WU03:FS01:0x17:Reading tar file system.xml
19:46:03:WU03:FS01:0x17:Reading tar file integrator.xml
19:46:03:WU03:FS01:0x17:Reading tar file core.xml
19:46:03:WU03:FS01:0x17:Digital signatures verified
19:46:03:WU03:FS01:0x17:Folding@home GPU core17
19:46:03:WU03:FS01:0x17:Version 0.0.52
19:49:10:WU02:FS00:0xa4:Completed 390000 out of 500000 steps (78%)
19:49:55:WU03:FS01:0x17:ERROR:exception: Force RMSE error of 453.966 with threshold of 5
19:49:55:WU03:FS01:0x17:Saving result file logfile_01.txt
19:49:55:WU03:FS01:0x17:Saving result file log.txt
19:49:55:WU03:FS01:0x17:Folding@home Core Shutdown: BAD_WORK_UNIT
19:49:56:WARNING:WU03:FS01:FahCore returned: BAD_WORK_UNIT (114 = 0x72)
19:49:56:WU03:FS01:Sending unit results: id:03 state:SEND error:FAULTY project:13001 run:236 clone:3 gen:11 core:0x17 unit:0x0000001d538b3db7532892a3432b10e4
19:49:56:WU03:FS01:Uploading 2.26KiB to 140.163.4.231
19:49:56:WU03:FS01:Connecting to 140.163.4.231:8080
19:49:56:WU03:FS01:Upload complete
19:49:56:WU03:FS01:Server responded WORK_ACK (400)
Any thoughts?

Re: GTX 780 BAd Work Units?

Posted: Fri Oct 10, 2014 6:25 pm
by rwh202
Are you sure which gpu is folding and which is failing?
That failure is typical of maxwell on core_17 which would be the 970. The client often mixes up the slots on a mixed GPU system so it might be worth checking the temps / usage of the gpu to be sure which is running. I had that problem with a 660 and 750ti in the same system.
If that's the issue, then it will be a case of manually setting the slot/cuda/opencl ids.

Re: GTX 780 BAd Work Units?

Posted: Fri Oct 10, 2014 6:29 pm
by bruce
That WU has been reissued to 5 different people and in each case, it failed. There's always a chance of bad WUs and the only way to identify them is to process them. They're reissued a few times and either completed or taken out of circulation. The assignments of this WU were all on 2014-10-05 and it was withdrawn from circulation.

It truly is a BAD_WORK_UNIT and this has NOTHING to do with the AS. Asking about a single failure after this much time has elapsed doesn't help anybody identify a problem which can be solved.

Re: GTX 780 BAd Work Units?

Posted: Fri Oct 10, 2014 6:44 pm
by s/j
RWH, I wondered the same thing so I un-installed F@H and re-installed with one GPU to verify, as I also thought perhaps it was actually the 970 failing. I confirmed it is the 780.

Bruce, as stated, I have tried deleting work folder and re-booting and try this multiple times per day. I get that it is not a single bad WU, just saying that is the log message. I could post mulitple logs from today with the exact same message but I don't see how that contributes anything either? I thought my post made it apparent this is not a single failure.

Re: GTX 780 BAd Work Units?

Posted: Fri Oct 10, 2014 6:49 pm
by bruce
Please list the project/run/clone/gen of several more WUs that are failing.

Re: GTX 780 BAd Work Units?

Posted: Fri Oct 10, 2014 7:02 pm
by Breach
13001 had a similar failure on Maxwells, only that there all WUs would fail instantly.

viewtopic.php?f=18&t=26807&start=60#p269635

This failure rather looks more similar to the failures of 10470-10473:

viewtopic.php?f=66&t=26528&start=60#p269314

Where the fault woudl happen during folding (with some WUs only from experience).

Re: GTX 780 BAd Work Units?

Posted: Fri Oct 10, 2014 7:10 pm
by s/j
Will do Bruce. (It's going to suck if this GTX 780 has died after one week).

Here are a few from today:
18:56:41:WARNING:WU00:FS01:FahCore returned: BAD_WORK_UNIT (114 = 0x72)
18:56:41:WU00:FS01:Sending unit results: id:00 state:SEND error:FAULTY project:13001 run:69 clone:3 gen:42 core:0x17 unit:0x0000006c538b3db75328634d9a354a91

17:24:53:WARNING:WU03:FS01:FahCore returned: BAD_WORK_UNIT (114 = 0x72)
17:24:53:WU03:FS01:Sending unit results: id:03 state:SEND error:FAULTY project:13001 run:15 clone:8 gen:31 core:0x17 unit:0x00000040538b3db753285433c0ce9452

17:13:35:WARNING:WU00:FS01:FahCore returned: BAD_WORK_UNIT (114 = 0x72)
17:13:35:WU00:FS01:Sending unit results: id:00 state:SEND error:FAULTY project:13000 run:960 clone:1 gen:19 core:0x17 unit:0x00000025538b3db75310aabe24976893

As RWH suggested, If I didn't know better I would swear the system has the 970 and 780 confused.

Thanks.

Re: GTX 780 BAd Work Units?

Posted: Fri Oct 10, 2014 7:22 pm
by bruce
project:13001 run:69 clone:3 gen:42. Bad WU. Failed repeatedly.
project:13001 run:15 clone:8 gen:31. Indeterminate until later. Failed only for you.
project:13000 run:960 clone:1 gen:19. Bad WU. Failed repeatedly.

Re: GTX 780 BAd Work Units?

Posted: Fri Oct 10, 2014 7:22 pm
by Kjetil
It is the same error i have on maxwell, now they running, but very slow. 980-PPD 17H, 750Ti 1D 12H on core 18 P1047x.

Re: GTX 780 BAd Work Units?

Posted: Fri Oct 10, 2014 7:43 pm
by bruce
This may or may not help, but it can't hurt.

In a part of the log just before the part that you posted, you'll find a message something like
Running FahCore: "C:\Program Files (x86)\FAHClient/FAHCoreWrapper.exe" xyz/FahCore_17.exe ...(arguments)

Carefully copy the entire colored directory including whatever actually appears where I've abbreviated a long path as xyz.
Stop your client.
Delete the file xyz/FahCore_17.exe
Restart the client.
(A new copy of xyz/FahCore_17.exe will download and work should resume.)

Let me know if this helps.

Re: GTX 780 BAd Work Units?

Posted: Fri Oct 10, 2014 8:11 pm
by s/j
Bruce, no joy. I verified new core download:
20:06:00:WU00:FS01:Downloading core from http://web.stanford.edu/~pande/Win32/AM ... ore_17.fah

Re: GTX 780 BAd Work Units?

Posted: Fri Oct 10, 2014 8:16 pm
by 7im
What version was downloaded? v52 or v55

Re: GTX 780 BAd Work Units?

Posted: Fri Oct 10, 2014 8:32 pm
by s/j
7im: Where do I get that? Don't see in core file properties.
Here is what log says:
20:06:04:WU00:FS01:Running FahCore: "C:\Program Files (x86)\FAHClient/FAHCoreWrapper.exe" "C:/Users/Eds Sled/AppData/Roaming/FAHClient/cores/web.stanford.edu/~pande/Win32/AMD64/NVIDIA/Fermi/Core_17.fah/FahCore_17.exe" -dir 00 -suffix 01 -version 704 -lifeline 3276 -checkpoint 15 -gpu 0 -gpu-vendor nvidia
20:06:04:WU00:FS01:Started FahCore on PID 3788

Re: GTX 780 BAd Work Units?

Posted: Fri Oct 10, 2014 8:41 pm
by bollix47
Navigate to the directory with FahCore_17.exe in it and open a command prompt.

Type:

FahCore_17.exe --info

The version should be just under the ******Build****** line.

EDIT: The following is much easier. :e?:

Re: GTX 780 BAd Work Units?

Posted: Fri Oct 10, 2014 9:03 pm
by 7im
Or in the log file when the FAHCore and work unit starts up...
19:46:02:WU03:FS01:0x17:Unit: 0x0000001d538b3db7532892a3432b10e4
19:46:02:WU03:FS01:0x17:CPU: 0x00000000000000000000000000000000
19:46:02:WU03:FS01:0x17:Machine: 1
19:46:02:WU03:FS01:0x17:Reading tar file state.xml
19:46:02:WU03:FS01:0x17:Reading tar file system.xml
19:46:03:WU03:FS01:0x17:Reading tar file integrator.xml
19:46:03:WU03:FS01:0x17:Reading tar file core.xml
19:46:03:WU03:FS01:0x17:Digital signatures verified
19:46:03:WU03:FS01:0x17:Folding@home GPU core17
19:46:03:WU03:FS01:0x17:Version 0.0.52