Page 5 of 8
Re: Failing units, low ppd, and returned units.
Posted: Thu Nov 19, 2015 5:07 am
by Scarlet-Tech
7im wrote:Yes, The problem persists. Lowering the memory clocks was never a solution, simply a workaround that has helped some people finish more work units. This is while we wait for Stanford to revise and improve the core.
Stanford is able to see what gpu's are processing work units, as far as I understand.. Wouldn't it be possible to have work units that aren't compatible with a specific architecture (ie maxwell) not assigned?
I mean, they did give a bonus once the Maxwell cards were instituted in 2014, and the Maxwell card were receiving better ppd after that bonus.
I don't know how Stanford sets the work units. I know AMD can get different units than Nvidia, and some newer cards receive bonuses compared to older even if they fold at the same TPF, so this is all an assumption and a probe for possible information.
Re: Failing units, low ppd, and returned units.
Posted: Thu Nov 19, 2015 5:25 am
by Scarlet-Tech
We also had members that are running these tests designate specific flags and specific diseases. Neither did anything.
With the advanced flags enabled, the same work units were received as the users without the advanced flag. The only difference was lower ppd. All units were:
Disease Type Unknown (or Unspecified), Malfunction of the CLC chloride ion channel.
Designating work units to fold for a specific diseases rendered the same results: disease type, unknown or unspecified, except the description was for Cystic Fibrosis, even though the specified disease is not being picked up.
I think there is 4 or 5 designators, and forgive me for not having a computer or being able to get the exact list through Google.
I know Cancer, Parkinsons, alzheimers, and a couple of others I can't remember off the top of my head.
So, if these designators are set and flags are not set, why are the same units picked up in any configurations and only the point amount change. The same failures are present and adjusting settings aren't working. Is it maybe a client issue, since the client isn't requesting the preferred disease or client-type?
Re: Failing units, low ppd, and returned units.
Posted: Thu Nov 19, 2015 6:34 am
by bruce
Scarlet-Tech wrote:With the advanced flags enabled, the same work units were received as the users without the advanced flag. The only difference was lower ppd
This is a myth. It's not a true statement. How can we combat all the false information that's floating around the various team forums?
If you're assigned, say, a WU from P9712 with the Adv flag set, you'll earn EXACTLY the same points if you're assigned a P9712 WU without that flag set.
Scarlet-Tech wrote:Wouldn't it be possible to have work units that aren't compatible with a specific architecture (ie maxwell) not assigned?
In a general sense, they're working toward such a temporary solution. Unfortunately none fit the binary characterization of "compatible" vs. "not-compatible" but rather have an error rate that varies. Let's say that some specific project on Maxwell the error rate varies between 15% and 90%, depending on the particular system the WUs is assigned to and that for Fermi it varies between 2% and 15%. It probably makes sense to stop assigning those WUs to the folks who are dumping 90% of the WUs but then they also wouldn't be assigned to those people who are completing 85% of their assignments. [I'm
making all these numbers up. I have no data that would allow me to evaluate the range of error rates for specific projects or for specific hardware.]
The Pande Group is working on what I believe is a better temporary fix.
Re: Failing units, low ppd, and returned units.
Posted: Thu Nov 19, 2015 6:39 am
by Scarlet-Tech
Ok. So, why is the advanced flag even available if you are going to get the same work units either way?
Advanced flags used to deliver post beta work units, before they are released as regular units. So, if they are just regular units, what is the purpose of the advanced flag?
I would rather full answers over cherry picked portions of my posts. Again, we over at EVGA are trying to figure out what to do to complete more units and get more data back to Stanford rather than corrupted data. Focusing in the least helpful portion of a post and not answering he rest is going to end up pushing multiple of our dedicated folders over to crunching, which we are trying to avoid.
Re: Failing units, low ppd, and returned units.
Posted: Thu Nov 19, 2015 7:20 am
by bruce
I didn't say you're going to get the same WUs, I said if you do get the same WU, the points will not change.
WUs are first tested internally by PG, then moved to beta, then moved to Adv, then moved to full FAH. Supposedly each move only takes place when the error rates are lower than some number. Now suppose a project that's having trouble on Maxwell happens to be tested on a lot of Fermi so that the average error rate is acceptable. After it's moved, more Maxwells join the testing and the average error rate goes up. I don't know if that's what happened here, but it might have. {And there are other factors that influence the average error rate, not just the GPU type.}
Many Donors (like yourself) focus almost exclusively on points. Others do not. The Pande Group focuses most of their attention on the amount of science being completed. Points are roughly correlated with science, but not perfectly.
People who join the beta team or people who run Adv are willing to accept higher percentages of errors with a secondary goal of helping isolate problems, report them, and get them fixed. At the same time, those Donor are exposed to the newer WUs before others get them. That can be either an advantage or a disadvantage. Their overall PPD is supposed to be similar to those who run full-FAH but it can vary if the error rate is too high or if the initial points setting needs to be adjusted (either up or down). If adjustments are needed, they generally happen quickly and then remain the same for the life of the project as it proceeds toward full released status.
Again, suppose there are only a few projects in full-FAH that can be processed on your system. Suppose there's a shortage of WUs for those projects so your system has to wait until a WU is available. Adding the Adv flag will give you access to a wider variety of projects so sometimes you'll get an assignment sooner rather than later. This can be a good choice if all of the Adv projects already have a low error rate.
Re: Failing units, low ppd, and returned units.
Posted: Thu Nov 19, 2015 7:21 am
by billford
Scarlet-Tech wrote:Ok. So, why is the advanced flag even available if you are going to get the same work units either way?
Advanced flags used to deliver post beta work units, before they are released as regular units. So, if they are just regular units, what is the purpose of the advanced flag?
You'll only get advanced WUs (ie post-beta but pre-release) if any projects happen to be at that stage- at the moment I don't think there are any (for GPUs, not sure about smp) so the advanced flag doesn't do anything. This is subject to change at any time, the progress of a project from beta => advanced => full (and occasionally in the opposite direction) is given in the
Announcements section.
Re: Failing units, low ppd, and returned units.
Posted: Thu Nov 19, 2015 11:59 am
by Scarlet-Tech
bruce wrote:I didn't say you're going to get the same WUs, I said if you do get the same WU, the points will not change.
Many Donors (like yourself) focus almost exclusively on points. Others do not. The Pande Group focuses most of their attention on the amount of science being completed. Points are roughly correlated with science, but not perfectly.
That post is very helpful. I focus more on comleted work units. Notice that the week I started this thread, I was getting 14 completed work units per day on average.. This week, I am up to 34. That is a huge improvement meaning that Stanford is getting 20nextra completed units back. Why do we use ppd when referring to it, because most people care about their ppd.
Before the errors, I was getting 34 units on average completed at 1.5m ppd. Then dropped to between 12-15 units comleted and 880kto 1.2m.
Now,this week I averaged 34 work units at 1.9m ppd on the same hardware after requesting help and sending up the units that we were all experiencing the most failures on. (mine are best guess, as I am obviously not near my computer.)
Thid week has been better for many of our folders, but many are also has till having the issues, so we would love to help get this entire thing solved.
With, or without, the advanced flag the same units are currently being picked up by the users that are experiencing failures, even when a disease is selected. So, that is slightly concerning for most of our guys right now.
billford wrote:
You'll only get advanced WUs (ie post-beta but pre-release) if any projects happen to be at that stage- at the moment I don't think there are any (for GPUs, not sure about smp) so the advanced flag doesn't do anything. This is subject to change at any time, the progress of a project from beta => advanced => full (and occasionally in the opposite direction) is given in the
Announcements section.
I will favorite that section right now. Thank you sir.
Re: Failing units, low ppd, and returned units.
Posted: Thu Nov 19, 2015 4:37 pm
by 7im
FYI, the disease preference while configurable in the client, is currently only a place holder for a feature to be implemented at the server level at a later time. The preference currently has no affect on work unit assignment.
Re: Failing units, low ppd, and returned units.
Posted: Thu Nov 19, 2015 5:04 pm
by Rel25917
Scarlet-Tech wrote:
That post is very helpful. I focus more on comleted work units. Notice that the week I started this thread, I was getting 14 completed work units per day on average.. This week, I am up to 34. That is a huge improvement meaning that Stanford is getting 20nextra completed units back. Why do we use ppd when referring to it, because most people care about their ppd.
We go by ppd as there can be a very large difference in the time taken to complete units from different projects. If you get a streak of small units you may may get 8 units from a single card a day but the largest units may only get you 2 completed but in both cases ppd will be similar. Also in both cases you've really done the same amount of work, unit size is largely a funxtion of how much detail the researcher wants so smaller units are used. (thats how I understand it from comments on this site anyway.)
Re: Failing units, low ppd, and returned units.
Posted: Thu Nov 19, 2015 10:30 pm
by bcavnaugh
Round Two, First Error on the GTX 980 HC Rig.
12:28:50:WU02:FS01:Starting
12:28:50:WU02:FS01:Running FahCore: "D:\Program Files (x86)\FAHClient/FAHCoreWrapper.exe" D:/ProgramData/FAHClient/cores/web.stanford.edu/~pande/Win32/AMD64/NVIDIA/Fermi/Core_21.fah/FahCore_21.exe -dir 02 -suffix 01 -version 704 -lifeline 5792 -checkpoint 15 -gpu 0 -gpu-vendor nvidia
12:28:50:WU02:FS01:Started FahCore on PID 884
12:28:50:WU02:FS01:Core PID:5868
12:28:50:WU02:FS01:FahCore 0x21 started
12:28:50:WU02:FS01:0x21:*********************** Log Started 2015-11-19T12:28:50Z ***********************
12:28:50:WU02:FS01:0x21:Project: 9205 (Run 28, Clone 19, Gen 11)
12:28:50:WU02:FS01:0x21:Unit: 0x0000003e664f2dd055d4d6bd3df7234a
12:28:50:WU02:FS01:0x21:CPU: 0x00000000000000000000000000000000
12:28:50:WU02:FS01:0x21:Machine: 1
12:28:50:WU02:FS01:0x21:Reading tar file core.xml
12:28:50:WU02:FS01:0x21:Reading tar file system.xml
12:28:51:WU02:FS01:0x21:Reading tar file integrator.xml
12:28:51:WU02:FS01:0x21:Reading tar file state.xml
12:28:52:WU02:FS01:0x21:Digital signatures verified
12:28:52:WU02:FS01:0x21:Folding@home GPU Core21 Folding @ home Core
12:28:52:WU02:FS01:0x21:Version 0.0.12
12:29:36:WU02:FS01:0x21:ERROR:Potential energy error of 712.638, threshold of 10
12:29:36:WU02:FS01:0x21:ERROR:Reference Potential Energy: -1.24026e+006 | Given Potential Energy: -1.23955e+006
12:29:36:WU02:FS01:0x21:Saving result file logfile_01.txt
12:29:36:WU02:FS01:0x21:Saving result file log.txt
12:29:36:WU02:FS01:0x21:Folding@home Core Shutdown: BAD_WORK_UNIT
12:29:37:WARNING:WU02:FS01:FahCore returned: BAD_WORK_UNIT (114 = 0x72)
12:29:37:WU02:FS01:Sending unit results: id:02 state:SEND error:FAULTY project:9205 run:28 clone:19 gen:11 core:0x21 unit:0x0000003e664f2dd055d4d6bd3df7234a
12:29:37:WU02:FS01:Uploading 2.27KiB to 171.64.65.104
12:29:37:WU02:FS01:Connecting to 171.64.65.104:8080
12:29:37:WU02:FS01:Upload complete
12:29:37:WU02:FS01:Server responded WORK_ACK (400)
12:29:37:WU02:FS01:Cleaning up
Re: Failing units, low ppd, and returned units.
Posted: Thu Nov 19, 2015 10:32 pm
by bcavnaugh
Good news with no flags set I am no longer getting any Core 21 P96xx Projects.
Core 18 ZETA_DEV is all I see now!
Re: Failing units, low ppd, and returned units.
Posted: Thu Nov 19, 2015 10:33 pm
by bcavnaugh
Rel25917 wrote:Scarlet-Tech wrote:
That post is very helpful. I focus more on comleted work units. Notice that the week I started this thread, I was getting 14 completed work units per day on average.. This week, I am up to 34. That is a huge improvement meaning that Stanford is getting 20nextra completed units back. Why do we use ppd when referring to it, because most people care about their ppd.
We go by ppd as there can be a very large difference in the time taken to complete units from different projects. If you get a streak of small units you may may get 8 units from a single card a day but the largest units may only get you 2 completed but in both cases ppd will be similar. Also in both cases you've really done the same amount of work, unit size is largely a funxtion of how much detail the researcher wants so smaller units are used. (thats how I understand it from comments on this site anyway.)
This is not really so true if half of the small projects fail.
Re: Failing units, low ppd, and returned units.
Posted: Thu Nov 19, 2015 10:34 pm
by bcavnaugh
7im wrote:FYI, the disease preference while configurable in the client, is currently only a place holder for a feature to be implemented at the server level at a later time. The preference currently has no affect on work unit assignment.
If I recall you said this two years ago as well!
Re: Failing units, low ppd, and returned units.
Posted: Thu Nov 19, 2015 10:37 pm
by bcavnaugh
Round Two Updates
Re: Failing units, low ppd, and returned units.
Posted: Thu Nov 19, 2015 11:01 pm
by mmonnin
'Stock' as in from EVGA or actual stock from NV? Anything over what NV recommends is an overclock.
No failed core 21 WUs on my 970.