Hello,
I am currently allowing some budget to fold on AWS with EC2. Sometimes the instance gets killed and I would like to backup the work directory to be able to resume it when I start another ec2 instance.
I tried to backup the /var/lib/fahclient/work directory but I am getting some errors when syncing it again and starting fah.
Any help would be appreciated. Thank you!
AWS EC2 - backup checkpoints directory
Moderators: Site Moderators, FAHC Science Team
Re: AWS EC2 - backup checkpoints directory
Are you backing up the /var/lib/fahclient/cores directory and /var/lib/fahclient/GPUs.txt too?
Online: GTX 1660 Super + occasional CPU folding in the cold.
Offline: Radeon HD 7770, GTX 1050 Ti 4G OC, RX580
Re: AWS EC2 - backup checkpoints directory
I've backed up the entire /var/lib/fahclient in the end and it still doesn't work. This is the error I am getting:gunnarre wrote:Are you backing up the /var/lib/fahclient/cores directory and /var/lib/fahclient/GPUs.txt too?
Code: Select all
9:03:01:Trying to access database...
09:03:01:Successfully acquired database lock
^[[93m09:03:01:WARNING:FS01:Guessing ambiguous GPU to OpenCL device mapping for 01: gpu:0:30 TU104GL [Tesla T4]. Consider upgrading your graphics driver or manually setting ``opencl-index`` in this slot's configuration.^[[0m
09:03:01:FS01:Initialized folding slot 01: gpu:0:30 TU104GL [Tesla T4]
09:03:01:WU00:FS01:Starting
09:03:01:WU00:FS01:Running FahCore: /usr/bin/FAHCoreWrapper /var/lib/fahclient/cores/cores.foldingathome.org/lin/64bit/22-0.0.20/Core_22.fah/FahCore_22 -dir 00 -suffix 01 -version 706 -lifeline 4262 -checkpoint 15 -opencl-platform 0 -opencl-device 0 -cuda-device 0 -gpu-vendor nvidia -gpu 0 -gpu-usage 100
09:03:01:WU00:FS01:Started FahCore on PID 4274
09:03:01:WU00:FS01:Core PID:4278
09:03:01:WU00:FS01:FahCore 0x22 started
^[[93m09:03:01:WARNING:WU00:FS01:FahCore returned: FAILED_3 (255 = 0xff)^[[0m
09:03:01:WU00:FS01:Starting
09:03:01:WU00:FS01:Running FahCore: /usr/bin/FAHCoreWrapper /var/lib/fahclient/cores/cores.foldingathome.org/lin/64bit/22-0.0.20/Core_22.fah/FahCore_22 -dir 00 -suffix 01 -version 706 -lifeline 4262 -checkpoint 15 -opencl-platform 0 -opencl-device 0 -cuda-device 0 -gpu-vendor nvidia -gpu 0 -gpu-usage 100
09:03:01:WU00:FS01:Started FahCore on PID 4279
09:03:01:WU00:FS01:Core PID:4283
09:03:01:WU00:FS01:FahCore 0x22 started
^[[93m09:03:02:WARNING:WU00:FS01:FahCore returned: FAILED_3 (255 = 0xff)^[[0m
Re: AWS EC2 - backup checkpoints directory
The PCI ID of the GPU likely changes between instances, so you should perhaps over-write the config.xml file with a fresh one that re-discovers the GPU and adds it as a slot with the correct OpenCL IDs, provided that this happens before WU is dumped. If doing it that way dumps the WU, you might have to do it in a different way:
1: Start FAH with a config that has no GPU slot, but has gpu set to true (default) for auto-configuring the GPU, and pause-on-start set to true to avoid picking a new WU.
2: After the client has added the GPU slot successfully, stop fahclient.
3: Sync the work and cores folder from the backup, but do not over-write config.xml
4: Start Fahclient
1: Start FAH with a config that has no GPU slot, but has gpu set to true (default) for auto-configuring the GPU, and pause-on-start set to true to avoid picking a new WU.
2: After the client has added the GPU slot successfully, stop fahclient.
3: Sync the work and cores folder from the backup, but do not over-write config.xml
4: Start Fahclient
Online: GTX 1660 Super + occasional CPU folding in the cold.
Offline: Radeon HD 7770, GTX 1050 Ti 4G OC, RX580
-
- Site Moderator
- Posts: 6394
- Joined: Sun Dec 02, 2007 10:38 am
- Location: Bordeaux, France
- Contact:
Re: AWS EC2 - backup checkpoints directory
I don't know AWS, but maybe you can find some inspiration from these tutorials (GCP/Azure) : https://github.com/gitHu6-newb/FoldingAtAltitude
It's easier to use the older client (7.6.13) than the latest one (7.6.21) to avoid automatic GPU detection and the new config scheme with pci-bus and pci-slot settings ...
Also, you might need some persistent storage associated with your AWS instance ...
edit : also, be careful with folder permissions after restoring it.
It's easier to use the older client (7.6.13) than the latest one (7.6.21) to avoid automatic GPU detection and the new config scheme with pci-bus and pci-slot settings ...
Also, you might need some persistent storage associated with your AWS instance ...
edit : also, be careful with folder permissions after restoring it.
Re: AWS EC2 - backup checkpoints directory
One person who folds on cloud similar to this reports that turning off automatic GPU detection, by setting "gpu" to "false" in the config helps. So the opposite of what I suggested at first.
Online: GTX 1660 Super + occasional CPU folding in the cold.
Offline: Radeon HD 7770, GTX 1050 Ti 4G OC, RX580
Re: AWS EC2 - backup checkpoints directory
Thank you for the feedback. I am currently testing multiple scenario and still not able to make it work. I will report back here when I am getting something.gunnarre wrote:One person who folds on cloud similar to this reports that turning off automatic GPU detection, by setting "gpu" to "false" in the config helps. So the opposite of what I suggested at first.
Appreciate the help though thanks!
Re: AWS EC2 - backup checkpoints directory
I, too, recommend persistent storage. I never ran GCP or Azure without it, and it will likely be your easiest path to success. That's about all I can think of, good luck!