Page 10 of 10
Re: New Assignment Server feedback/problem
Posted: Sat Nov 01, 2014 9:02 pm
by bruce
heikosch wrote:When the new AS was activated (just over night for me!) my GTX 750Ti began to throw errors with P1300x. Core 0x17 Version didn´t change and I didn´t change nVidia driver nor installed other Software or updates.
Maybe they changed not only the AS but independently the content of the P1300x WUs. Shortly after that they stopped to assign P1300x to Maxwell GPUs.
I'd question whether all of that is true or not. A given project cannot be changed without disrupting the validity of the science. If drivers and the FahCore version did not change, then either the percentage of WUs failing was higher than you think and you were lucky to not see the failures or maybe you're thinking of some other project that was assigned to use Core_18.
As has already been said, the Assignment Server is responsible for assigning your client to a specific Work Server that can supply your work for your system. During the first exposure of the Assignment Server to the world, some projects and/or cores were assigned to systems that could not process them. Those bugs seem to have eradicated so now Core_17 and Core_18 and Core_b0 are only assigned to systems that can process them, but either your system could process a WU before and (if it's still assigned) it still can process it or your system had problems with WUs from specific projects before and now they're not being assigned.
Re: New Assignment Server feedback/problem
Posted: Sat Nov 01, 2014 9:50 pm
by JimF
It certainly happened to me at the same time. I was surprised that PG did not catch it themselves then, and even more surprised that it is questioned now. It may be a coincidence, but that is another question.
viewtopic.php?f=18&t=26807&p=269497&hilit=GTX#p269497
viewtopic.php?f=80&t=25887&start=135
Re: New Assignment Server feedback/problem
Posted: Sat Nov 01, 2014 10:07 pm
by heikosch
OK, I looked it up in my log files. I refer to the "Force RMSE error" that occured in a lot of different projects, not only P1300x.
First occurence in my logs at2014-10-02T23:38:09Z which is in fact after the new AS was activated. But Prof. Pande announced "Upgraded Maxwell support for Core17" that day. Maybe that changes for Maxwell support and assignment configuration errors mixed up.
Heiko
Re: New Assignment Server feedback/problem
Posted: Sat Nov 01, 2014 11:15 pm
by Breach
heikosch wrote:OK, I looked it up in my log files. I refer to the "Force RMSE error" that occured in a lot of different projects, not only P1300x.
First occurence in my logs at2014-10-02T23:38:09Z which is in fact after the new AS was activated. But Prof. Pande announced "Upgraded Maxwell support for Core17" that day. Maybe that changes for Maxwell support and assignment configuration errors mixed up.
Heiko
AFAIK, Maxwell is currently still "partially supported". In the sense that both current core 17 versions 0.0.52 and 0.0.55 would work just fine when folding WUs from some projects, e.g. those from P9201/9202/7814 (non-exhaustive list). With other projects (e.g. 1300x) the core would apparently do a call for a function which is "buggy" (I guess on the level of nvidia drivers/hardware?) hence the RMSE problems. Although this was reported as a problem months ago, it only became more manifest when the new AS came on-line as it was initially giving out way more "problematic" projects to Maxwells than before. This issue has been discussed at length already and the "solution" was indeed not to assign WUs from known problematic projects to Maxwells. This AFAIK is still the case today so you shouldn't be seeing these errors now (and if you do, please report it). This will hopefully change soon because all "newer" core 17 projects are apparently utilising the said problematic function and as much I have learned to enjoy my fan at 100% when folding Core 15 WUs it's getting a tad old and 9201 will end at some point. I guess we're already running low on 9201 WUs, and no, no idea why we're not getting WUs from P9202 and P7814 - either the projects are over or they have been rightly/mistakenly blacklisted for Maxwells(?). By the way this problem has been worked-around in core 18 on beta at a (huge) PPD expense - hopefully a more permanent solution will be found soon.
Re: New Assignment Server feedback/problem
Posted: Sun Nov 02, 2014 12:50 am
by bruce
I think what happened was that when the AS code wasn't working right, everything got an increased degree of scrutiny. The "Force RMSE error" problem was already a kiw frequency problem, but once they recognized it, they set out to fix it and by not assigning those projects to certain GPUs, the success rate went up.
I'm certain that some developers, somewhere, are examining the trade-offs and in time, there will be a permanent fix for the problem.
Re: New Assignment Server feedback/problem
Posted: Sun Nov 02, 2014 2:21 am
by JimF
Great, that is all I, and probably a lot of others need to know.
Re: New Assignment Server feedback/problem
Posted: Sun Nov 02, 2014 6:51 am
by billford
bruce wrote:The "Force RMSE error" problem was already a kiw frequency problem, but once they recognized it, they set out to fix it and by not assigning those projects to certain GPUs, the success rate went up.
That certainly makes sense, but I wonder- did the
overall rate of completed WUs (ie the amount of science getting done in a given time) also go up? It's quite possible to get a higher overall return (not quite the same thing as success rate) by accepting a certain failure rate in some areas, especially when the "unreliable" platform is much faster than the others.
In the matter under discussion this will certainly be true for Linux clients when P9201 goes EOL, and may even still be true when the slower Core_18 comes online.
Certainly some donors won't like it, but if the "marginal" WUs were set to advanced for that platform (or even beta, with the acceptance that no support would be given) they would have the choice, and setting the advanced flag implies acceptance of a certain degree of risk anyway.
(And setting the beta flag virtually guarantees it!)
It may well be that PG went through these calculations and the results pointed to the same decision, I'm just curious. And also fairly certain that my curiosity won't be satisfied
bruce wrote:I'm certain that some developers, somewhere, are examining the trade-offs and in time, there will be a permanent fix for the problem.
The operative words there are "in time"
Re: New Assignment Server feedback/problem
Posted: Mon Nov 03, 2014 11:26 pm
by Kjetil
My 5 970 and 6 980 is off, sorry.
Re: New Assignment Server feedback/problem
Posted: Thu Nov 13, 2014 5:55 pm
by heikosch
Breach wrote:According to that: 171.67.108.52 is 'full' (in full operation, should be giving out WUs), but is then marked as 'Blue' ("Blue - if the AS has decided not to assign to that machine, eg. the AS thinks it is down or out of jobs (blue means iced)". The WU stats for this WS are null - guess it's either out of work or there's another reason the AS considers it not available...
In the last days I looked at the server status of 171.67.108.52 and my computer got WUs from 171.67.108.52 independently of the blue color in the "% Ass" column and the numbers in the 3 columns in front.
=> I´ve no idea how to interpret the server status correctly.
Maybe the server status changes so often that the log is always outdated.
Heiko