Big loss of efficiency.

If you're new to FAH and need help getting started or you have very basic questions, start here.

Moderators: Site Moderators, FAHC Science Team

Post Reply
Gennady
Posts: 13
Joined: Tue Mar 24, 2020 9:20 am

Big loss of efficiency.

Post by Gennady »

Hello there,
Thank you for your great job!

I use a client under Windows. I don’t quite understand the logic of connecting the client to the servers.
I tried other projects of distributed computing before there were no similar problems.
What specifically is not clear to me / raises doubts:
Let the task run 04:12:14 - 05:35:52 (a little over an hour). Getting a new task sometimes far exceeds this time.
But in addition, the client each time an error doubles the waiting time. As an error, both unsuccessful connections and "No WUs available for this configuration" are accepted.
Those. my computer is more idle than it is doing the job. The client says that the next attempt to connect to the server can be made only after 2 hours.

Here the shorted logs:

Code: Select all

04:12:11:WU01:FS00:0xa7:Project: 14328 (Run 8, Clone 4136, Gen 7)
04:12:11:WU01:FS00:0xa7:Unit: 0x0000000a9bf7a4d65e6d0c0f8ecbac28
04:12:11:WU01:FS00:0xa7:Reading tar file core.xml
04:12:11:WU01:FS00:0xa7:Reading tar file frame7.tpr
04:12:11:WU01:FS00:0xa7:Digital signatures verified
04:12:11:WU01:FS00:0xa7:Calling: mdrun -s frame7.tpr -o frame7.trr -cpt 15 -nt 6
04:12:12:WU01:FS00:0xa7:Steps: first=1750000 total=250000
04:12:14:WU01:FS00:0xa7:Completed 1 out of 250000 steps (0%)
04:13:04:WU01:FS00:0xa7:Completed 2500 out of 250000 steps (1%)
04:13:54:WU01:FS00:0xa7:Completed 5000 out of 250000 steps (2%)
04:14:45:WU01:FS00:0xa7:Completed 7500 out of 250000 steps (3%)
04:15:35:WU01:FS00:0xa7:Completed 10000 out of 250000 steps (4%)
04:16:24:WU01:FS00:0xa7:Completed 12500 out of 250000 steps (5%)
04:17:13:WU01:FS00:0xa7:Completed 15000 out of 250000 steps (6%)
04:18:03:WU01:FS00:0xa7:Completed 17500 out of 250000 steps (7%)
04:18:52:WU00:FS01:0x22:Completed 300000 out of 1000000 steps (30%)
04:18:53:WU01:FS00:0xa7:Completed 20000 out of 250000 steps (8%)
04:19:43:WU01:FS00:0xa7:Completed 22500 out of 250000 steps (9%)
04:20:32:WU01:FS00:0xa7:Completed 25000 out of 250000 steps (10%)
04:21:23:WU01:FS00:0xa7:Completed 27500 out of 250000 steps (11%)
04:22:13:WU01:FS00:0xa7:Completed 30000 out of 250000 steps (12%)
04:23:02:WU01:FS00:0xa7:Completed 32500 out of 250000 steps (13%)
04:23:52:WU01:FS00:0xa7:Completed 35000 out of 250000 steps (14%)
//.....SKIP.....
05:34:59:WU01:FS00:0xa7:Completed 247500 out of 250000 steps (99%)
05:34:59:WU02:FS00:Connecting to 65.254.110.245:8080
05:35:00:WARNING:WU02:FS00:Failed to get assignment from '65.254.110.245:8080': No WUs available for this configuration
05:35:00:WU02:FS00:Connecting to 18.218.241.186:80
05:35:00:WARNING:WU02:FS00:Failed to get assignment from '18.218.241.186:80': No WUs available for this configuration
05:35:00:ERROR:WU02:FS00:Exception: Could not get an assignment
05:35:00:WU02:FS00:Connecting to 65.254.110.245:8080
05:35:01:WARNING:WU02:FS00:Failed to get assignment from '65.254.110.245:8080': No WUs available for this configuration
05:35:01:WU02:FS00:Connecting to 18.218.241.186:80
05:35:01:WARNING:WU02:FS00:Failed to get assignment from '18.218.241.186:80': No WUs available for this configuration
05:35:01:ERROR:WU02:FS00:Exception: Could not get an assignment
05:35:48:WU01:FS00:0xa7:Completed 250000 out of 250000 steps (100%)
05:35:51:WU01:FS00:0xa7:Saving result file ..\logfile_01.txt
05:35:51:WU01:FS00:0xa7:Saving result file frame7.trr
05:35:51:WU01:FS00:0xa7:Saving result file md.log
05:35:52:WU01:FS00:0xa7:Saving result file science.log
05:35:52:WU01:FS00:0xa7:Saving result file traj_comp.xtc
05:35:52:WU01:FS00:0xa7:Folding@home Core Shutdown: FINISHED_UNIT
05:35:52:WU01:FS00:FahCore returned: FINISHED_UNIT (100 = 0x64)
//...SKIP...
05:36:34:WU01:FS00:Upload 74.47%
05:36:40:WU01:FS00:Upload 84.57%
05:36:46:WU01:FS00:Upload 94.67%
05:36:50:WU01:FS00:Upload complete
05:36:50:WU01:FS00:Server responded WORK_ACK (400)
//...SKIP...
08:52:25:WARNING:WU02:FS00:Failed to get assignment from '65.254.110.245:8080': No WUs available for this configuration
08:52:25:WU02:FS00:Connecting to 18.218.241.186:80
08:52:26:WARNING:WU02:FS00:Failed to get assignment from '18.218.241.186:80': No WUs available for this configuration
08:52:26:ERROR:WU02:FS00:Exception: Could not get an assignment
08:57:05:WU00:FS01:0x22:Completed 660000 out of 1000000 steps (66%)
09:04:47:WU00:FS01:0x22:Completed 670000 out of 1000000 steps (67%)
09:12:32:WU00:FS01:0x22:Completed 680000 out of 1000000 steps (68%)
09:20:25:WU00:FS01:0x22:Completed 690000 out of 1000000 steps (69%)
As you can see, the computer finished working at night. It sent the results but still have not received a new task.
It’s now 12:00 and the client writes that it will make the next attempt to get the task in 59 mins 02 secs.
And perhaps this attempt will also be erroneous (!)

My suggestions:
1. You must enable the client to reset the reconnect timeout. Or, do not double the intervals, but develop some other averaging algorithm;
2. Correct the connection timeout according to the type of error;
3. Perhaps it is worth giving the opportunity to change the server.

Thanks again for your great work!
jonault
Posts: 216
Joined: Fri Dec 14, 2007 9:53 pm

Re: Big loss of efficiency.

Post by jonault »

Running out of work units is a problem the project has never had before (that I can recall) so it's not surprising that the software is handling it in a less than ideal fashion. AIUI there is a new client in development; I suspect the lessons currently being learned will be applied.
Image
Jesse_V
Site Moderator
Posts: 2850
Joined: Mon Jul 18, 2011 4:44 am
Hardware configuration: OS: Windows 10, Kubuntu 19.04
CPU: i7-6700k
GPU: GTX 970, GTX 1080 TI
RAM: 24 GB DDR4
Location: Western Washington

Re: Big loss of efficiency.

Post by Jesse_V »

"I don’t quite understand the logic of connecting the client to the servers." The clients connect to the servers to fetch workunits and return completed work. This is functionally similar to BOINC or SETI@home except on a much more massive scale.

"the client each time an error doubles the waiting time" this is because the servers are currently flooded from the massive numbers of new users that joined F@h over the past month or so. The clients will back off a little more every time to avoid a collective DDoS of the server from the constant pings. Your computer should pick up work when it's available. More information is here: viewtopic.php?f=61&t=33193
F@h is now the top computing platform on the planet and nothing unites people like a dedicated fight against a common enemy. This virus affects all of us. Lets end it together.
Gennady
Posts: 13
Joined: Tue Mar 24, 2020 9:20 am

Re: Big loss of efficiency.

Post by Gennady »

Jesse_V wrote:"I don’t quite understand the logic of connecting the client to the servers." The clients connect to the servers to fetch workunits and return completed work. This is functionally similar to BOINC or SETI@home except on a much more massive scale.
I apologize for my English, maybe I didn’t explain it that way. I don’t understand the client’s program decision to increase the timeout. At all, it increases it more than the time value of the task to calculate. For example, some of them increased by more than 4 hours.
Jesse_V wrote: "the client each time an error doubles the waiting time" this is because the servers are currently flooded from the massive numbers of new users that joined F@h over the past month or so. The clients will back off a little more every time to avoid a collective DDoS of the server from the constant pings. Your computer should pick up work when it's available. More information is here: viewtopic.php?f=61&t=33193
Sorry, I can’t imagine the general architecture of the program, but it’s possible that for now: you can allow participants to store the results locally without sending them to the server. Or allow it to be something like a repository.
Jesse_V
Site Moderator
Posts: 2850
Joined: Mon Jul 18, 2011 4:44 am
Hardware configuration: OS: Windows 10, Kubuntu 19.04
CPU: i7-6700k
GPU: GTX 970, GTX 1080 TI
RAM: 24 GB DDR4
Location: Western Washington

Re: Big loss of efficiency.

Post by Jesse_V »

Gennady wrote:I apologize for my English, maybe I didn’t explain it that way. I don’t understand the client’s program decision to increase the timeout. At all, it increases it more than the time value of the task to calculate. For example, some of them increased by more than 4 hours.
The client sees the error message and assumes, correctly, that the server is overloaded. It then increases the timeout between polls to avoid putting more load on the server. There are hundreds of thousands of people in the network; if each machine kept polling every couple of seconds, the server would never be able to escape the load.
Gennady wrote:Sorry, I can’t imagine the general architecture of the program, but it’s possible that for now: you can allow participants to store the results locally without sending them to the server. Or allow it to be something like a repository.
The results have to go back because there's a workunit that's next in line from yours. Each unit has identifiers like 1234 (1, 2, 3). The last number is called the Generation. If I complete 3, I have to send it back because the server has to build a workunit for Generation 4 that you might pick up. The first three numbers are parallelized, but the final generation number is sequential. That's why people can't hold onto the workunits for a while, because it would delay the project.
F@h is now the top computing platform on the planet and nothing unites people like a dedicated fight against a common enemy. This virus affects all of us. Lets end it together.
Gennady
Posts: 13
Joined: Tue Mar 24, 2020 9:20 am

Re: Big loss of efficiency.

Post by Gennady »

Jesse_V wrote:
Gennady wrote:I apologize for my English, maybe I didn’t explain it that way. I don’t understand the client’s program decision to increase the timeout. At all, it increases it more than the time value of the task to calculate. For example, some of them increased by more than 4 hours.
The client sees the error message and assumes, correctly, that the server is overloaded. It then increases the timeout between polls to avoid putting more load on the server. There are hundreds of thousands of people in the network; if each machine kept polling every couple of seconds, the server would never be able to escape the load.
Yes, I understand that. Rather, I can assume what problems you are currently facing. I really want to help.
I’ll try to make another suggestion, don’t kill me :oops: : perhaps the server in its error message can inform the client about the level of its load, so that the client puts the server in a stop list. Or calculate the time of the next attempt.
Jesse_V wrote:
Gennady wrote:Sorry, I can’t imagine the general architecture of the program, but it’s possible that for now: you can allow participants to store the results locally without sending them to the server. Or allow it to be something like a repository.
The results have to go back because there's a workunit that's next in line from yours. Each unit has identifiers like 1234 (1, 2, 3). The last number is called the Generation. If I complete 3, I have to send it back because the server has to build a workunit for Generation 4 that you might pick up. The first three numbers are parallelized, but the final generation number is sequential. That's why people can't hold onto the workunits for a while, because it would delay the project.
But it may happen that the client never returns the result for some reason. If the calculation of the last fourth is so fundamental and relies on the results of everyone who calculated 3, this can also cause paralysis.
Perhaps this is wasteful, but some results can possibly be given for recheck. In this case, the reliability of the foundation will increase.
Neil-B
Posts: 1996
Joined: Sun Mar 22, 2020 5:52 pm
Hardware configuration: 1: 2x Xeon E5-2697v3@2.60GHz, 512GB DDR4 LRDIMM, SSD Raid, Win10 Ent 20H2, Quadro K420 1GB, FAH 7.6.21
2: Xeon E3-1505Mv5@2.80GHz, 32GB DDR4, NVME, Win10 Pro 20H2, Quadro M1000M 2GB, FAH 7.6.21 (actually have two of these)
3: i7-960@3.20GHz, 12GB DDR3, SSD, Win10 Pro 20H2, GTX 750Ti 2GB, GTX 1080Ti 11GB, FAH 7.6.21
Location: UK

Re: Big loss of efficiency.

Post by Neil-B »

There are "Timeout" and "Expiry" date/times attached to each WU when assigned - iirc if not WU not returned within the expiry it is reissued … rather than race machines to get the quickest result by issuing the same WU many times F@H (again iirc) issues just once to maximise work that can be done.

Problem is with the current rapidly increasing abundance of folders there is a lag ramping up both the production of WUs and the infrastructure to assign and manage them … At the moment it seems "baffling/odd/even mad" but actually in the longer term as the infrastructure and process catch up much more with be achievable by not reissuing WUs multiple times (if at all possible) … and again the increasing wait times under normal circumstances makes sense and is not an issue … but at the moment is very painful for folders trying to invest/help get WUs done - basically we need to bear with the team and let them catch up/get the process running at a new norm :)

caveat - just a folder, not part of the F@H team - any mistakes in above are mine lack of understanding
2x Xeon E5-2697v3, 512GB DDR4 LRDIMM, SSD Raid, W10-Ent, Quadro K420
Xeon E3-1505Mv5, 32GB DDR4, NVME, W10-Pro, Quadro M1000M
i7-960, 12GB DDR3, SSD, W10-Pro, GTX1080Ti
i9-10850K, 64GB DDR4, NVME, W11-Pro, RTX3070

(Green/Bold = Active)
Post Reply