Page 2 of 3

Re: Re: New AS testing

Posted: Wed Oct 07, 2015 9:08 pm
by Joe_H
Some of the projects on 128.143.199.97 have been restricted for assignment due to issues of WU's being created with too many steps. There are topics on Projects 7520 & 7528 connected to that problem. Dr. Kasson is aware of the problem, but has not been able to get that fixed yet.

Re: New AS testing

Posted: Wed Oct 07, 2015 9:13 pm
by arkaine23
Have over a hundred i-5's running 6.34 SMP. Seeing a production drop ~20% the last few days. Not able to babysit all of these clients, but I know its effecting many/most.... Figure it was just low/no SMP WU availablity, then saw this thread.

Log from one:

[22:01:47] Folding@home Core Shutdown: FINISHED_UNIT
[22:01:50] CoreStatus = 64 (100)
[22:01:50] Sending work to server
[22:01:50] Project: 9752 (Run 2021, Clone 0, Gen 246)


[22:01:50] + Attempting to send results [October 5 22:01:50 UTC]
[22:02:03] + Results successfully sent
[22:02:03] Thank you for your contribution to Folding@Home.
[22:02:03] + Number of Units Completed: 1044

[22:02:07] - Preparing to get new work unit...
[22:02:07] Cleaning up work directory
[22:02:07] + Attempting to get work packet
[22:02:07] Passkey found
[22:02:07] - Connecting to assignment server
[22:02:08] + No appropriate work server was available; will try again in a bit.
[22:02:08] + Couldn't get work instructions.
[22:02:08] - Attempt #1 to get work failed, and no other work to do.
Waiting before retry.
[22:02:23] + Attempting to get work packet
[22:02:23] Passkey found
[22:02:23] - Connecting to assignment server
[22:02:23] + No appropriate work server was available; will try again in a bit.
[22:02:23] + Couldn't get work instructions.
[22:02:23] - Attempt #2 to get work failed, and no other work to do.
Waiting before retry.
[22:02:36] + Attempting to get work packet
[22:02:36] Passkey found
[22:02:36] - Connecting to assignment server
[22:02:37] + No appropriate work server was available; will try again in a bit.
[22:02:37] + Couldn't get work instructions.
[22:02:37] - Attempt #3 to get work failed, and no other work to do.
Waiting before retry.


It's up to 67 retries.

Re: Re: New AS testing

Posted: Wed Oct 07, 2015 9:32 pm
by Joe_H
Joe Coffland did post -viewtopic.php?f=24&p=279775#p279772 - that there were some issues with a new AS working with vV6 clients and that he hopedthat would be resolved soon. Possibly addition issues still exist, will ask that he check on that.

P.S. Depending on your systems, they may also have been affected by one WS improperly handling returns part of the day yesterday - viewtopic.php?f=18&t=28169. That server does have a 6.34 minimum version allowed for assignment.

Re: Re: New AS testing

Posted: Wed Oct 07, 2015 11:41 pm
by bruce
Let's assume that WUs for CPUs with higher numbers of threads are currently in limited circulation for any of the reasons given above. Let's also assume that there are plenty of WUs that can run on 12 threads (a semi-arbitrary number -- choose your own value).

It's not difficult to reconfigure a 48-way system into 4 slots using CPU:12 -- but that's assuming V7. It's quite a bit more challenging if you're running V6.

Ordinary we do not recommend splitting a CPU up to run concurrent WUs with fewer threads, but this may be the exception.

All I can say is the the high-thread count projects will come back on line soon™ and you'll be able to switch back.

Re: Re: New AS testing

Posted: Thu Oct 08, 2015 10:24 pm
by toTOW
I don't know if it related to the recent AS upgrade, or to WS updates, but the psummary ( http://fah-web.stanford.edu/new/psummaryC.html ) is broken ... projects with blank fields and "NaN" string instead of deadline value.

Joe, can you look at this ?

Re: New AS testing

Posted: Sat Oct 10, 2015 7:40 pm
by Dead Things
Just wanted to ask if someone would kindly make an announcement when SMP projects are available again. No point keeping the machines going doing nothing, so I've shut them all down.

Re: Re: New AS testing

Posted: Sat Oct 10, 2015 8:03 pm
by toTOW
The quickest fix is to update your client with v7, many SMP projects are available for it ... and it's the safest way to keep contributing.

Re: Re: New AS testing

Posted: Sat Oct 10, 2015 9:02 pm
by Dead Things
Thanks - it's a holiday weekend here, so if we're still running dry come Tuesday, I'll look into upgrading the clients.

Re: Re: New AS testing

Posted: Sun Oct 11, 2015 5:52 am
by Grandpa_01
toTOW wrote:The quickest fix is to update your client with v7, many SMP projects are available for it ... and it's the safest way to keep contributing.
That is not necessarily true for those of us that have multi socket multi core rigs v7 is not the answer since there is a limited supply of smp WU's that will run on more than 24 cores, we can do as bruce suggested and run multiple WU.s at 24 or less but even if you do that you will still be assigned to the same server which has a limited supply of WU's and if you do get 3 or more you will face a very large deficit in PPD. The only viable option I have found for multi socket with 48 core or greater than 48 core rigs is v6 running the bigadv flag.

Re: Re: New AS testing

Posted: Sun Oct 11, 2015 11:44 am
by Nathan_P
toTOW wrote:The quickest fix is to update your client with v7, many SMP projects are available for it ... and it's the safest way to keep contributing.
This "upgrade to v7" solution is starting to get boring, v6 is perfectly good for cpu work - no core updates have been sent out in years. the problem is a lack of SMP work for anything over 24 cores - that is what needs fixing and the bigadv replacement WU are not and should not be the only answer. People with multi socket rigs are starting to get interested in FAH again - lets not send all that hardware back to boinc or wcg.

Re: New AS testing

Posted: Mon Oct 12, 2015 12:15 am
by 7im
No core updates in years, yet AVX was just recently mentioned again. But there have been assignment server updates, and V6 will never work with those newer servers.

Bollix proved V7 could run just as fast as V6 on multi socket multi core servers. So not upgrading is what's getting boring.

Re: Re: New AS testing

Posted: Mon Oct 12, 2015 1:31 am
by bruce
Nathan_P wrote:...the problem is a lack of SMP work for anything over 24 cores - that is what needs fixing...
Large proteins don't run well on machines with a few CPUs and small proteins simply cannot run on machines with lots of cores. Nobody disputes that '>24" needs fixing but that's an issue for the Pande Group, not the support forum. We can't do anything about the corruption that occurred or about the amount of time involved in fixing it. We can only suggest ways that YOU might get around the problem until it's fixed. If you don't choose to accept any of those suggestions, that's on you.

... or, you could be helpful and come up with some other suggestions of things that are within your power to fix. This is a community support forum, and you're knowledgeable enough to offer constructive support, too.

Re: New AS testing

Posted: Mon Oct 12, 2015 1:59 am
by Grandpa_01
7im wrote:No core updates in years, yet AVX was just recently mentioned again. But there have been assignment server updates, and V6 will never work with those newer servers.

Bollix proved V7 could run just as fast as V6 on multi socket multi core servers. So not upgrading is what's getting boring.
Below is the HFM logs from both the {H} v6 and the PG v7 they are not even close.

Code: Select all

 Project ID: 8106
 Core: GRO_A5
 Credit: 5856
 Frames: 100


 Name: Core32 Slot 00
 Path: 10.0.0.10-36330  (v7 of FAH)
 Number of Frames Observed: 209

 Min. Time / Frame : 00:08:41 - 162,365.0 PPD
 Avg. Time / Frame : 00:09:09 - 150,103.4 PPD


 Name: Core321 (v6 of FAH)
 Path: \\CORE32\fah\  (v6 of FAH)
 Number of Frames Observed: 300

 Min. Time / Frame : 00:05:20 - 337,305.9 PPD
 Avg. Time / Frame : 00:05:36 - 313,501.8 PPD
 Cur. Time / Frame : 00:05:41 - 308,110.1 PPD
 R3F. Time / Frame : 00:05:38 - 310,969.1 PPD
 All  Time / Frame : 00:05:37 - 311,933.6 PPD
 Eff. Time / Frame : 00:05:37 - 311,933.6 PPD


 Name: Musky1
 Path: \\SCOTTY\fah\ [(v6 of FAH)
 Number of Frames Observed: 300

 Min. Time / Frame : 00:05:04 - 364,282.7 PPD
 Avg. Time / Frame : 00:05:26 - 328,036.7 PPD
 Cur. Time / Frame : 00:05:29 - 322,385.0 PPD
 R3F. Time / Frame : 00:05:26 - 326,625.6 PPD
 All  Time / Frame : 00:05:26 - 326,625.6 PPD
 Eff. Time / Frame : 00:05:46 - 299,999.4 PPD


 Name: Patriot Slot 00
 Path: 10.0.0.17-36330  (v7 of FAH)
 Number of Frames Observed: 108

 Min. Time / Frame : 00:08:23 - 171,157.9 PPD
 Avg. Time / Frame : 00:08:36 - 164,730.6 PPD


 Name: Patriot1
 Path: \\PATRIOT\fah\ [(v6 of FAH)
 Number of Frames Observed: 300

 Min. Time / Frame : 00:05:15 - 345,368.8 PPD
 Avg. Time / Frame : 00:05:28 - 325,041.1 PPD
 Cur. Time / Frame : 00:05:33 - 313,440.6 PPD
 R3F. Time / Frame : 00:05:51 - 293,306.5 PPD
 All  Time / Frame : 00:05:46 - 298,672.2 PPD
 Eff. Time / Frame : 00:05:52 - 292,253.2 PPD


 Name: tear1
 Path: \\TEAR\fah\ (v6 of FAH)
 Number of Frames Observed: 210

 Min. Time / Frame : 00:05:23 - 332,617.5 PPD
 Avg. Time / Frame : 00:05:30 - 322,090.5 PPD
 Cur. Time / Frame : 00:05:30 - 320,818.1 PPD
 R3F. Time / Frame : 00:05:29 - 322,219.7 PPD
 All  Time / Frame : 00:05:29 - 322,219.7 PPD
 Eff. Time / Frame : 00:05:51 - 293,578.7 PPD

Code: Select all

Project ID: 8108
 Core: GRO_A5
 Credit: 7349
 Frames: 100


 Name: Core321
 Path: \\CORE32\fah\ (v6 of FAH)
 Number of Frames Observed: 300

 Min. Time / Frame : 00:07:13 - 300,990.1 PPD
 Avg. Time / Frame : 00:07:27 - 286,961.0 PPD


 Name: Musky Slot 00
 Path: 10.0.0.11-36330  (v7 of FAH)
 Number of Frames Observed: 200

 Min. Time / Frame : 00:11:19 - 153,277.9 PPD
 Avg. Time / Frame : 00:11:40 - 146,432.4 PPD


 Name: Musky Slot 02
 Path: 10.0.0.11-36330  (v7 of FAH)
 Number of Frames Observed: 2

 Min. Time / Frame : 00:12:10 - 137,499.1 PPD
 Avg. Time / Frame : 00:12:14 - 136,376.6 PPD


 Name: Musky1
 Path: \\SCOTTY\fah\  (v6 of FAH)
 Number of Frames Observed: 300

 Min. Time / Frame : 00:06:45 - 332,737.3 PPD
 Avg. Time / Frame : 00:07:16 - 297,888.9 PPD


 Name: Patriot Slot 00
 Path: 10.0.0.17-36330  (v7 of FAH)
 Number of Frames Observed: 100

 Min. Time / Frame : 00:11:25 - 151,268.4 PPD
 Avg. Time / Frame : 00:11:33 - 148,656.7 PPD


 Name: Patriot1
 Path: \\PATRIOT\fah\  (v6 of FAH)
 Number of Frames Observed: 300

 Min. Time / Frame : 00:06:47 - 330,287.7 PPD
 Avg. Time / Frame : 00:07:07 - 307,356.5 PPD


 Name: tear1
 Path: \\TEAR\fah\  (v6 of FAH) 
 Number of Frames Observed: 300

 Min. Time / Frame : 00:07:04 - 310,624.2 PPD
 Avg. Time / Frame : 00:07:12 - 302,035.8 PPD

Re: Re: New AS testing

Posted: Mon Oct 12, 2015 2:08 am
by Grandpa_01
bruce wrote:
Nathan_P wrote:...the problem is a lack of SMP work for anything over 24 cores - that is what needs fixing...
Large proteins don't run well on machines with a few CPUs and small proteins simply cannot run on machines with lots of cores. Nobody disputes that '>24" needs fixing but that's an issue for the Pande Group, not the support forum. We can't do anything about the corruption that occurred or about the amount of time involved in fixing it. We can only suggest ways that YOU might get around the problem until it's fixed. If you don't choose to accept any of those suggestions, that's on you.

... or, you could be helpful and come up with some other suggestions of things that are within your power to fix. This is a community support forum, and you're knowledgeable enough to offer constructive support, too.
bruce there is no cure at this time if you run 12 core on a 64 core box you still get assigned to the same server as the large proteins are on thus still a shortage of work, and that does not even address the points deficit that comes with running them at a slower tpf 2 - smp WU's = more PPD than 1 on the large boxes using v7 running 3 smp WU = around 30% less $ = around 1/2, the only viable option at this time is v6 with the bigadv flag.

Re: Re: New AS testing

Posted: Mon Oct 12, 2015 4:17 am
by bruce
Grandpa_01 wrote:bruce there is no cure at this time if you run 12 core on a 64 core box you still get assigned to the same server as the large proteins are on thus still a shortage of work, and that does not even address the points deficit that comes with running them at a slower tpf 2 - smp WU's = more PPD than 1 on the large boxes using v7 running 3 smp WU = around 30% less $ = around 1/2, the only viable option at this time is v6 with the bigadv flag.
No doubt, but as I just said, it's not a community support issue. Only PG can do anything about it.

Once again, discussing it here will not get you anywhere because the PG members almost never read topics in this forum. They have moved their support to reddit.

I can't manufacture new projects or fix broken ones and neither can anybody else on this forum.