Thank you for your great job!
I use a client under Windows. I don’t quite understand the logic of connecting the client to the servers.
I tried other projects of distributed computing before there were no similar problems.
What specifically is not clear to me / raises doubts:
Let the task run 04:12:14 - 05:35:52 (a little over an hour). Getting a new task sometimes far exceeds this time.
But in addition, the client each time an error doubles the waiting time. As an error, both unsuccessful connections and "No WUs available for this configuration" are accepted.
Those. my computer is more idle than it is doing the job. The client says that the next attempt to connect to the server can be made only after 2 hours.
Here the shorted logs:
Code: Select all
04:12:11:WU01:FS00:0xa7:Project: 14328 (Run 8, Clone 4136, Gen 7)
04:12:11:WU01:FS00:0xa7:Unit: 0x0000000a9bf7a4d65e6d0c0f8ecbac28
04:12:11:WU01:FS00:0xa7:Reading tar file core.xml
04:12:11:WU01:FS00:0xa7:Reading tar file frame7.tpr
04:12:11:WU01:FS00:0xa7:Digital signatures verified
04:12:11:WU01:FS00:0xa7:Calling: mdrun -s frame7.tpr -o frame7.trr -cpt 15 -nt 6
04:12:12:WU01:FS00:0xa7:Steps: first=1750000 total=250000
04:12:14:WU01:FS00:0xa7:Completed 1 out of 250000 steps (0%)
04:13:04:WU01:FS00:0xa7:Completed 2500 out of 250000 steps (1%)
04:13:54:WU01:FS00:0xa7:Completed 5000 out of 250000 steps (2%)
04:14:45:WU01:FS00:0xa7:Completed 7500 out of 250000 steps (3%)
04:15:35:WU01:FS00:0xa7:Completed 10000 out of 250000 steps (4%)
04:16:24:WU01:FS00:0xa7:Completed 12500 out of 250000 steps (5%)
04:17:13:WU01:FS00:0xa7:Completed 15000 out of 250000 steps (6%)
04:18:03:WU01:FS00:0xa7:Completed 17500 out of 250000 steps (7%)
04:18:52:WU00:FS01:0x22:Completed 300000 out of 1000000 steps (30%)
04:18:53:WU01:FS00:0xa7:Completed 20000 out of 250000 steps (8%)
04:19:43:WU01:FS00:0xa7:Completed 22500 out of 250000 steps (9%)
04:20:32:WU01:FS00:0xa7:Completed 25000 out of 250000 steps (10%)
04:21:23:WU01:FS00:0xa7:Completed 27500 out of 250000 steps (11%)
04:22:13:WU01:FS00:0xa7:Completed 30000 out of 250000 steps (12%)
04:23:02:WU01:FS00:0xa7:Completed 32500 out of 250000 steps (13%)
04:23:52:WU01:FS00:0xa7:Completed 35000 out of 250000 steps (14%)
//.....SKIP.....
05:34:59:WU01:FS00:0xa7:Completed 247500 out of 250000 steps (99%)
05:34:59:WU02:FS00:Connecting to 65.254.110.245:8080
05:35:00:WARNING:WU02:FS00:Failed to get assignment from '65.254.110.245:8080': No WUs available for this configuration
05:35:00:WU02:FS00:Connecting to 18.218.241.186:80
05:35:00:WARNING:WU02:FS00:Failed to get assignment from '18.218.241.186:80': No WUs available for this configuration
05:35:00:ERROR:WU02:FS00:Exception: Could not get an assignment
05:35:00:WU02:FS00:Connecting to 65.254.110.245:8080
05:35:01:WARNING:WU02:FS00:Failed to get assignment from '65.254.110.245:8080': No WUs available for this configuration
05:35:01:WU02:FS00:Connecting to 18.218.241.186:80
05:35:01:WARNING:WU02:FS00:Failed to get assignment from '18.218.241.186:80': No WUs available for this configuration
05:35:01:ERROR:WU02:FS00:Exception: Could not get an assignment
05:35:48:WU01:FS00:0xa7:Completed 250000 out of 250000 steps (100%)
05:35:51:WU01:FS00:0xa7:Saving result file ..\logfile_01.txt
05:35:51:WU01:FS00:0xa7:Saving result file frame7.trr
05:35:51:WU01:FS00:0xa7:Saving result file md.log
05:35:52:WU01:FS00:0xa7:Saving result file science.log
05:35:52:WU01:FS00:0xa7:Saving result file traj_comp.xtc
05:35:52:WU01:FS00:0xa7:Folding@home Core Shutdown: FINISHED_UNIT
05:35:52:WU01:FS00:FahCore returned: FINISHED_UNIT (100 = 0x64)
//...SKIP...
05:36:34:WU01:FS00:Upload 74.47%
05:36:40:WU01:FS00:Upload 84.57%
05:36:46:WU01:FS00:Upload 94.67%
05:36:50:WU01:FS00:Upload complete
05:36:50:WU01:FS00:Server responded WORK_ACK (400)
//...SKIP...
08:52:25:WARNING:WU02:FS00:Failed to get assignment from '65.254.110.245:8080': No WUs available for this configuration
08:52:25:WU02:FS00:Connecting to 18.218.241.186:80
08:52:26:WARNING:WU02:FS00:Failed to get assignment from '18.218.241.186:80': No WUs available for this configuration
08:52:26:ERROR:WU02:FS00:Exception: Could not get an assignment
08:57:05:WU00:FS01:0x22:Completed 660000 out of 1000000 steps (66%)
09:04:47:WU00:FS01:0x22:Completed 670000 out of 1000000 steps (67%)
09:12:32:WU00:FS01:0x22:Completed 680000 out of 1000000 steps (68%)
09:20:25:WU00:FS01:0x22:Completed 690000 out of 1000000 steps (69%)
It’s now 12:00 and the client writes that it will make the next attempt to get the task in 59 mins 02 secs.
And perhaps this attempt will also be erroneous (!)
My suggestions:
1. You must enable the client to reset the reconnect timeout. Or, do not double the intervals, but develop some other averaging algorithm;
2. Correct the connection timeout according to the type of error;
3. Perhaps it is worth giving the opportunity to change the server.
Thanks again for your great work!