I'm trying to do some backfilling on a GPU farm, e.g. starting some GPU load if available and exiting if no work units are available. I am using FAHClient on Ubuntu 18.04. config.xml looks like this:
The full command line looks like this:
/usr/bin/FAHClient --user=ANALY_MANC_GPU --team=38188 --gpu=true --cuda-index=0 --smp=false --exit-when-done=true
The program hasn't received any work units and is sitting idle for hours, blocking the GPU on the farm. I would have expected that the --exit-when-done option would make FAHClient actually exit if no WUs are assigned.
20:52:32:WU00:FS00:Connecting to 18.218.241.186:80
20:52:33:WARNING:WU00:FS00:Failed to get assignment from '18.218.241.186:80': No WUs available for this configuration
20:52:33:ERROR:WU00:FS00:Exception: Could not get an assignment
20:59:23:WU00:FS00:Connecting to 65.254.110.245:8080
20:59:24:WARNING:WU00:FS00:Failed to get assignment from '65.254.110.245:8080': No WUs available for this configuration
20:59:24:WU00:FS00:Connecting to 18.218.241.186:80
20:59:24:WARNING:WU00:FS00:Failed to get assignment from '18.218.241.186:80': No WUs available for this configuration
20:59:24:ERROR:WU00:FS00:Exception: Could not get an assignment
21:10:29:WU00:FS00:Connecting to 65.254.110.245:8080
21:10:29:WARNING:WU00:FS00:Failed to get assignment from '65.254.110.245:8080': No WUs available for this configuration
21:10:29:WU00:FS00:Connecting to 18.218.241.186:80
21:10:30:WARNING:WU00:FS00:Failed to get assignment from '18.218.241.186:80': No WUs available for this configuration
21:10:30:ERROR:WU00:FS00:Exception: Could not get an assignment
21:28:25:WU00:FS00:Connecting to 65.254.110.245:8080
21:28:26:WARNING:WU00:FS00:Failed to get assignment from '65.254.110.245:8080': No WUs available for this configuration
21:28:26:WU00:FS00:Connecting to 18.218.241.186:80
21:28:26:WARNING:WU00:FS00:Failed to get assignment from '18.218.241.186:80': No WUs available for this configuration
21:28:26:ERROR:WU00:FS00:Exception: Could not get an assignment
21:57:28:WU00:FS00:Connecting to 65.254.110.245:8080
21:57:28:WARNING:WU00:FS00:Failed to get assignment from '65.254.110.245:8080': No WUs available for this configuration
21:57:28:WU00:FS00:Connecting to 18.218.241.186:80
21:57:29:WARNING:WU00:FS00:Failed to get assignment from '18.218.241.186:80': No WUs available for this configuration
21:57:29:ERROR:WU00:FS00:Exception: Could not get an assignment
22:44:26:WU00:FS00:Connecting to 65.254.110.245:8080
22:44:27:WARNING:WU00:FS00:Failed to get assignment from '65.254.110.245:8080': No WUs available for this configuration
22:44:27:WU00:FS00:Connecting to 18.218.241.186:80
22:44:27:WARNING:WU00:FS00:Failed to get assignment from '18.218.241.186:80': No WUs available for this configuration
22:44:27:ERROR:WU00:FS00:Exception: Could not get an assignment
00:00:28:WU00:FS00:Connecting to 65.254.110.245:8080
00:00:28:WARNING:WU00:FS00:Failed to get assignment from '65.254.110.245:8080': No WUs available for this configuration
00:00:28:WU00:FS00:Connecting to 18.218.241.186:80
00:00:29:WARNING:WU00:FS00:Failed to get assignment from '18.218.241.186:80': No WUs available for this configuration
00:00:29:ERROR:WU00:FS00:Exception: Could not get an assignment
02:03:27:WU00:FS00:Connecting to 65.254.110.245:8080
02:03:28:WARNING:WU00:FS00:Failed to get assignment from '65.254.110.245:8080': No WUs available for this configuration
02:03:28:WU00:FS00:Connecting to 18.218.241.186:80
02:03:28:WARNING:WU00:FS00:Failed to get assignment from '18.218.241.186:80': No WUs available for this configuration
02:03:28:ERROR:WU00:FS00:Exception: Could not get an assignment
******************************* Date: 2020-04-15 *******************************
05:22:28:WU00:FS00:Connecting to 65.254.110.245:8080
05:22:28:WARNING:WU00:FS00:Failed to get assignment from '65.254.110.245:8080': No WUs available for this configuration
05:22:28:WU00:FS00:Connecting to 18.218.241.186:80
05:22:29:WARNING:WU00:FS00:Failed to get assignment from '18.218.241.186:80': No WUs available for this configuration
05:22:29:ERROR:WU00:FS00:Exception: Could not get an assignment
It seems that since you started the client, you haven't been assigned a WU hence, it hasn't exited as it never finished a WU. There's a known issue where the demand for GPU WUs is significantly more than supply for GPU WUs. There's work in the pipeline to resolve this issue
ETA:
Now ↞ Very Soon ↔ Soon ↔ Soon-ish ↔ Not Soon ↠ End Of Time
21:35:20:WU01:FS00:Upload 70.73%
21:35:26:WU01:FS00:Upload 81.37%
21:35:30:WU00:FS00:Connecting to 65.254.110.245:8080
21:35:30:WARNING:WU00:FS00:Failed to get assignment from '65.254.110.245:8080': No WUs available for this configuration
21:35:30:WU00:FS00:Connecting to 18.218.241.186:80
21:35:31:WARNING:WU00:FS00:Failed to get assignment from '18.218.241.186:80': No WUs available for this configuration
21:35:31:ERROR:WU00:FS00:Exception: Could not get an assignment
21:35:32:WU01:FS00:Upload 95.01%
21:35:35:WU01:FS00:Upload complete
21:35:35:WU01:FS00:Server responded WORK_ACK (400)
21:35:35:WU01:FS00:Final credit estimate, 156113.00 points
21:35:35:WU01:FS00:Cleaning up
21:38:07:WU00:FS00:Connecting to 65.254.110.245:8080
21:38:08:WARNING:WU00:FS00:Failed to get assignment from '65.254.110.245:8080': No WUs available for this configuration
21:38:08:WU00:FS00:Connecting to 18.218.241.186:80
21:38:08:WARNING:WU00:FS00:Failed to get assignment from '18.218.241.186:80': No WUs available for this configuration
21:38:08:ERROR:WU00:FS00:Exception: Could not get an assignment
02:53:17:WU00:FS00:Connecting to 65.254.110.245:8080
02:53:18:WARNING:WU00:FS00:Failed to get assignment from '65.254.110.245:8080': No WUs available for this configuration
02:53:18:WU00:FS00:Connecting to 18.218.241.186:80
02:53:18:WARNING:WU00:FS00:Failed to get assignment from '18.218.241.186:80': No WUs available for this configuration
02:53:18:ERROR:WU00:FS00:Exception: Could not get an assignment
******************************* Date: 2020-04-15 *******************************
06:12:17:WU00:FS00:Connecting to 65.254.110.245:8080
06:12:18:WARNING:WU00:FS00:Failed to get assignment from '65.254.110.245:8080': No WUs available for this configuration
06:12:18:WU00:FS00:Connecting to 18.218.241.186:80
06:12:19:WARNING:WU00:FS00:Failed to get assignment from '18.218.241.186:80': No WUs available for this configuration
06:12:19:ERROR:WU00:FS00:Exception: Could not get an assignment
--finish
Finish all current work units, send the results, then exit.
I think --finish is for exiting an already running instance of Folding, if you start a new instance of Folding with --finish, it will never do anything. That's why I'm suspecting that the --exit-when-done=true option doesn't work as intended. Maybe my idling slot is never paused?
max-units <integer=0>
Process at most this number of units, then pause.
This this might be how you run it:
/usr/bin/FAHClient --user=ANALY_MANC_GPU --team=38188 --gpu=true --cuda-index=0 --smp=false --max-units=1 --exit-when-done=true
ETA:
Now ↞ Very Soon ↔ Soon ↔ Soon-ish ↔ Not Soon ↠ End Of Time
max-units <integer=0>
Process at most this number of units, then pause.
This this might be how you run it:
/usr/bin/FAHClient --user=ANALY_MANC_GPU --team=38188 --gpu=true --cuda-index=0 --smp=false --max-units=1 --exit-when-done=true
I'll give that a try. Any idea on how long F@H will try to get that one unit?
max-units <integer=0>
Process at most this number of units, then pause.
This this might be how you run it:
/usr/bin/FAHClient --user=ANALY_MANC_GPU --team=38188 --gpu=true --cuda-index=0 --smp=false --max-units=1 --exit-when-done=true
This command is working fine for me, the job takes between 1 and 2:15 hours processing one WU, that makes it perfect for backfilling.
Glad to hear that it works as per your expectations! If you can always change the number from 1 to 2 or whatever you think you can successfully fold within that time. Please note that the folding time for WUs varies from Project to Project so you may need to keep an eye on it
ETA:
Now ↞ Very Soon ↔ Soon ↔ Soon-ish ↔ Not Soon ↠ End Of Time
I'm not yet happy, there have been several cases where the program was idle for 4 hours without getting a WU. I would've preferred the program to exit in that case.