"Hung download" bug still present in v7.6.13
Moderators: Site Moderators, FAHC Science Team
"Hung download" bug still present in v7.6.13
FAH Team,
I've found that the "stuck download" bug is still present in v7.6.13, in which a stalled download will never reset, locking up the slot until FAHClient is forcibly restarted. I understand this is a known issue going back several releases.
Unfortunately, with the 9 CPU and 12 GPU slots I maintain, I run into this problem quite often. In Linux, pausing and unpausing the slot does not help, and the client becomes quite difficult to terminate during this - systemctl cannot do it, and it takes several attempts of 'killall FAHClient' before it yields. In Windows, right-clicking on the taskbar icon and selecting Quit causes the icon to disappear, but then one must go into Task Manager and forcibly close the process.
If there is still client development work planned, I'd appreciate it if a fix for this made it in. Not withstanding my lack of white-box knowledge of the client, from the surface I think it would be fairly easy to do by exploiting whatever code in the client produces the "Download xx%" messages in the log - whenever the issue comes up, these messages stop being printed. Perhaps there could be some watchdog implemented that closes the TCP connection and restarts the work unit acquisition process if more than a minute or so has elapsed between whatever invokes the printing of these messages.
Whatever happens, I will still be online to support your work. I realize you are dealing with a lot right now, and am impressed with the issues you've been able to tackle so far.
Cheers-
Sam (aka k2cc_amateur_radio_kc2lrc)
I've found that the "stuck download" bug is still present in v7.6.13, in which a stalled download will never reset, locking up the slot until FAHClient is forcibly restarted. I understand this is a known issue going back several releases.
Unfortunately, with the 9 CPU and 12 GPU slots I maintain, I run into this problem quite often. In Linux, pausing and unpausing the slot does not help, and the client becomes quite difficult to terminate during this - systemctl cannot do it, and it takes several attempts of 'killall FAHClient' before it yields. In Windows, right-clicking on the taskbar icon and selecting Quit causes the icon to disappear, but then one must go into Task Manager and forcibly close the process.
If there is still client development work planned, I'd appreciate it if a fix for this made it in. Not withstanding my lack of white-box knowledge of the client, from the surface I think it would be fairly easy to do by exploiting whatever code in the client produces the "Download xx%" messages in the log - whenever the issue comes up, these messages stop being printed. Perhaps there could be some watchdog implemented that closes the TCP connection and restarts the work unit acquisition process if more than a minute or so has elapsed between whatever invokes the printing of these messages.
Whatever happens, I will still be online to support your work. I realize you are dealing with a lot right now, and am impressed with the issues you've been able to tackle so far.
Cheers-
Sam (aka k2cc_amateur_radio_kc2lrc)
-
- Site Admin
- Posts: 7927
- Joined: Tue Apr 21, 2009 4:41 pm
- Hardware configuration: Mac Pro 2.8 quad 12 GB smp4
MacBook Pro 2.9 i7 8 GB smp2 - Location: W. MA
Re: "Hung download" bug still present in v7.6.13
There is already a sort of "watchdog" implemented, most of the time the client does detect a stalled or hung download within 15-30 minutes. Then it retries the download or upload as the bug does show up for both at times.
It is a long standing issue. What I can say is that the code does a much better job of detecting and retrying a stalled connection than it did prior to version 7.5.
As for client development work, I don't know what current plans are. I do know that some volunteer developers are working on moving the FAHControl portion to Python 3. As to the other components, no idea for the short term. Long term there were plans in place before COVID-19 for a major rewrite, that was put on hold for COVID-19.
It is a long standing issue. What I can say is that the code does a much better job of detecting and retrying a stalled connection than it did prior to version 7.5.
As for client development work, I don't know what current plans are. I do know that some volunteer developers are working on moving the FAHControl portion to Python 3. As to the other components, no idea for the short term. Long term there were plans in place before COVID-19 for a major rewrite, that was put on hold for COVID-19.
iMac 2.8 i7 12 GB smp8, Mac Pro 2.8 quad 12 GB smp6
MacBook Pro 2.9 i7 8 GB smp3
Re: "Hung download" bug still present in v7.6.13
Interesting! I do think it has improved since v7.4.4, but I got the issue 5 times on one system last week, and twice on another. Perhaps it's a factor of the first system having a faster GPU and turning over more work units, but I'm not sure. I'll keep an eye on it, and post logs if it happens again.
Cheers -
Sam
Cheers -
Sam
-
- Site Moderator
- Posts: 6986
- Joined: Wed Dec 23, 2009 9:33 am
- Hardware configuration: V7.6.21 -> Multi-purpose 24/7
Windows 10 64-bit
CPU:2/3/4/6 -> Intel i7-6700K
GPU:1 -> Nvidia GTX 1080 Ti
§
Retired:
2x Nvidia GTX 1070
Nvidia GTX 675M
Nvidia GTX 660 Ti
Nvidia GTX 650 SC
Nvidia GTX 260 896 MB SOC
Nvidia 9600GT 1 GB OC
Nvidia 9500M GS
Nvidia 8800GTS 320 MB
Intel Core i7-860
Intel Core i7-3840QM
Intel i3-3240
Intel Core 2 Duo E8200
Intel Core 2 Duo E6550
Intel Core 2 Duo T8300
Intel Pentium E5500
Intel Pentium E5400 - Location: Land Of The Long White Cloud
- Contact:
Re: "Hung download" bug still present in v7.6.13
Welcome to the F@H Forum kc2lrc,
Generally speaking, a fast system will likely encounter the issue more times than a slower system as there are more network connections being made so increased probability of encountering a network related issue.
Generally speaking, a fast system will likely encounter the issue more times than a slower system as there are more network connections being made so increased probability of encountering a network related issue.
ETA:
Now ↞ Very Soon ↔ Soon ↔ Soon-ish ↔ Not Soon ↠ End Of Time
Welcome To The F@H Support Forum Ӂ Troubleshooting Bad WUs Ӂ Troubleshooting Server Connectivity Issues
Now ↞ Very Soon ↔ Soon ↔ Soon-ish ↔ Not Soon ↠ End Of Time
Welcome To The F@H Support Forum Ӂ Troubleshooting Bad WUs Ӂ Troubleshooting Server Connectivity Issues
Re: "Hung download" bug still present in v7.6.13
Makes sense - the slot it usually affects is a GTX 1080, which turns over more work units per day than any of my other GPU slots. I've had 5 lockups on that one, and 2 on a GTX 970 slot, so figured I'd rattle the cages a bit. On the other hand, there's been no lockups in the past few days, so maybe I jinxed it in the right direction by posting this? No problem, if so!
I'll post here about how it's going if it becomes an issue again.
Cheers -
Sam
I'll post here about how it's going if it becomes an issue again.
Cheers -
Sam
Re: "Hung download" bug still present in v7.6.13
Hello
I also come across this today. Just one CPU slot, one GPU slot, the GPU hung on download, stuck at 19.72%. Over 2 hours ago so any watchdog doesn't seem to be working. The concurrent upload of the completed WU took three tries, perhaps the download is less able to recover from network glitches than the upload.
I also come across this today. Just one CPU slot, one GPU slot, the GPU hung on download, stuck at 19.72%. Over 2 hours ago so any watchdog doesn't seem to be working. The concurrent upload of the completed WU took three tries, perhaps the download is less able to recover from network glitches than the upload.
Code: Select all
......
16:17:40:WU00:FS01:0x22:Completed 980000 out of 1000000 steps (98%)
16:21:03:WU00:FS01:0x22:Completed 990000 out of 1000000 steps (99%)
16:21:04:WU01:FS01:Connecting to assign1.foldingathome.org:80
16:21:05:WU01:FS01:Assigned to work server 140.163.4.241
16:21:05:WU01:FS01:Requesting new work unit for slot 01: RUNNING gpu:0:GP106 [GeForce GTX 1060 6GB] 4372 from 140.163.4.241
16:21:05:WU01:FS01:Connecting to 140.163.4.241:8080
16:21:27:WU01:FS01:Downloading 7.92MiB
16:21:35:WU01:FS01:Download 19.72%
16:24:24:WU00:FS01:0x22:Completed 1000000 out of 1000000 steps (100%)
16:24:31:WU00:FS01:0x22:Saving result file ..\logfile_01.txt
16:24:31:WU00:FS01:0x22:Saving result file checkpointState.xml
16:24:37:WU00:FS01:0x22:Saving result file checkpt.crc
16:24:37:WU00:FS01:0x22:Saving result file positions.xtc
16:24:40:WU00:FS01:0x22:Saving result file science.log
16:24:40:WU00:FS01:0x22:Folding@home Core Shutdown: FINISHED_UNIT
16:24:41:WU00:FS01:FahCore returned: FINISHED_UNIT (100 = 0x64)
16:24:41:WU00:FS01:Sending unit results: id:00 state:SEND error:NO_ERROR project:11752 run:0 clone:5118 gen:39 core:0x22 unit:0x0000004d8ca304e75e6bbd9f646093eb
16:24:41:WU00:FS01:Uploading 24.34MiB to 140.163.4.231
16:24:41:WU00:FS01:Connecting to 140.163.4.231:8080
16:24:55:WU00:FS01:Upload 0.77%
16:24:55:WARNING:WU00:FS01:Exception: Failed to send results to work server: Transfer failed
16:24:55:WU00:FS01:Trying to send results to collection server
16:24:55:WU00:FS01:Uploading 24.34MiB to 52.224.109.74
16:24:55:WU00:FS01:Connecting to 52.224.109.74:8080
16:25:21:WU00:FS01:Upload 1.80%
16:25:29:WU00:FS01:Upload 3.34%
16:25:35:WU00:FS01:Upload 10.53%
16:25:41:WU00:FS01:Upload 13.61%
16:25:49:WU00:FS01:Upload 13.87%
16:26:45:WU00:FS01:Upload 19.00%
16:27:06:WU00:FS01:Upload 20.54%
16:27:06:ERROR:WU00:FS01:Exception: Transfer failed
16:27:06:WU00:FS01:Sending unit results: id:00 state:SEND error:NO_ERROR project:11752 run:0 clone:5118 gen:39 core:0x22 unit:0x0000004d8ca304e75e6bbd9f646093eb
16:27:06:WU00:FS01:Uploading 24.34MiB to 140.163.4.231
16:27:06:WU00:FS01:Connecting to 140.163.4.231:8080
16:27:12:WU00:FS01:Upload 0.51%
16:27:33:WU00:FS01:Upload 0.77%
16:27:39:WU00:FS01:Upload 1.54%
16:27:47:WU00:FS01:Upload 3.08%
[ .. snip .. ]
16:29:37:WU00:FS01:Upload 92.45%
16:29:53:WU00:FS01:Upload complete
16:29:53:WU00:FS01:Server responded WORK_ACK (400)
16:29:53:WU00:FS01:Final credit estimate, 83862.00 points
16:29:53:WU00:FS01:Cleaning up
[ no more entries ]
Re: "Hung download" bug still present in v7.6.13
Got a hang today - as of now (15:19Z) the watchdog has not reset this download. The system is Windows 10 64-bit with dual GTX 960s. I had to use Task Manager to kill FAHClient to resolve this.
System info:
Cheers -
Sam
Code: Select all
******************************* Date: 2020-05-25 *******************************
03:07:43:WU00:FS02:0x22:Completed 470000 out of 500000 steps (94%)
03:14:44:WU00:FS02:0x22:Completed 475000 out of 500000 steps (95%)
03:21:46:WU00:FS02:0x22:Completed 480000 out of 500000 steps (96%)
03:29:09:WU00:FS02:0x22:Completed 485000 out of 500000 steps (97%)
03:36:09:WU00:FS02:0x22:Completed 490000 out of 500000 steps (98%)
03:43:11:WU00:FS02:0x22:Completed 495000 out of 500000 steps (99%)
03:43:12:WU02:FS02:Connecting to assign1.foldingathome.org:80
03:43:12:WU02:FS02:Assigned to work server 155.247.166.220
03:43:12:WU02:FS02:Requesting new work unit for slot 02: RUNNING gpu:1:GM206 [GeForce GTX 960] 2308 from 155.247.166.220
03:43:12:WU02:FS02:Connecting to 155.247.166.220:8080
03:43:12:WU02:FS02:Downloading 5.12MiB
03:43:20:WU02:FS02:Download 4.89%
03:43:43:WU02:FS02:Download 6.11%
03:44:00:WU02:FS02:Download 7.33%
03:48:07:WU02:FS02:Download 10.99%
03:50:13:WU00:FS02:0x22:Completed 500000 out of 500000 steps (100%)
03:50:33:WU00:FS02:0x22:Saving result file ..\logfile_01.txt
03:50:33:WU00:FS02:0x22:Saving result file checkpointState.xml
03:50:40:WU00:FS02:0x22:Saving result file checkpt.crc
03:50:40:WU00:FS02:0x22:Saving result file positions.xtc
03:50:41:WU00:FS02:0x22:Saving result file science.log
03:50:41:WU00:FS02:0x22:Folding@home Core Shutdown: FINISHED_UNIT
03:50:42:WU00:FS02:FahCore returned: FINISHED_UNIT (100 = 0x64)
03:50:43:WU00:FS02:Sending unit results: id:00 state:SEND error:NO_ERROR project:14201 run:591 clone:1 gen:18 core:0x22 unit:0x00000018cedfaa925eb99bb986cc3f13
03:50:43:WU00:FS02:Uploading 40.91MiB to 206.223.170.146
03:50:43:WU00:FS02:Connecting to 206.223.170.146:8080
03:50:49:WU00:FS02:Upload 24.75%
03:50:55:WU00:FS02:Upload 50.57%
03:51:01:WU00:FS02:Upload 76.23%
03:51:07:WU00:FS02:Upload complete
03:51:07:WU00:FS02:Server responded WORK_ACK (400)
03:51:07:WU00:FS02:Final credit estimate, 96418.00 points
03:51:07:WU00:FS02:Cleaning up
******************************* Date: 2020-05-25 *******************************
******************************* Date: 2020-05-25 *******************************
Code: Select all
*********************** Log Started 2020-05-15T22:36:37Z ***********************
22:36:37:Trying to access database...
22:36:37:Successfully acquired database lock
22:36:37:Read GPUs.txt
22:36:37:Enabled folding slot 00: PAUSED cpu:6 (by user)
22:36:37:Enabled folding slot 01: PAUSED gpu:0:GM206 [GeForce GTX 960] 2308 (by user)
22:36:37:Enabled folding slot 02: PAUSED gpu:1:GM206 [GeForce GTX 960] 2308 (by user)
22:36:37:****************************** FAHClient ******************************
22:36:37: Version: 7.6.13
22:36:37: Author: Joseph Coffland <joseph@cauldrondevelopment.com>
22:36:37: Copyright: 2020 foldingathome.org
22:36:37: Homepage: https://foldingathome.org/
22:36:37: Date: Apr 27 2020
22:36:37: Time: 21:21:01
22:36:37: Revision: 5a652817f46116b6e135503af97f18e094414e3b
22:36:37: Branch: master
22:36:37: Compiler: Visual C++ 2008
22:36:37: Options: /TP /nologo /EHa /wd4297 /wd4103 /Ox /MT
22:36:37: Platform: win32 10
22:36:37: Bits: 32
22:36:37: Mode: Release
22:36:37: Config: G:\Installed Programs\FAHData\config.xml
22:36:37:******************************** CBang ********************************
22:36:37: Date: Apr 24 2020
22:36:37: Time: 17:07:55
22:36:37: Revision: ea081a3b3b0f4a37c4d0440b4f1bc184197c7797
22:36:37: Branch: master
22:36:37: Compiler: Visual C++ 2008
22:36:37: Options: /TP /nologo /EHa /wd4297 /wd4103 /Ox /MT
22:36:37: Platform: win32 10
22:36:37: Bits: 32
22:36:37: Mode: Release
22:36:37:******************************* System ********************************
22:36:37: CPU: Intel(R) Core(TM) i7-4790K CPU @ 4.00GHz
22:36:37: CPU ID: GenuineIntel Family 6 Model 60 Stepping 3
22:36:37: CPUs: 8
22:36:37: Memory: 15.91GiB
22:36:37: Free Memory: 12.71GiB
22:36:37: Threads: WINDOWS_THREADS
22:36:37: OS Version: 6.2
22:36:37: Has Battery: false
22:36:37: On Battery: false
22:36:37: UTC Offset: -4
22:36:37: PID: 109492
22:36:37: CWD: G:\Installed Programs\FAHData
22:36:37: Win32 Service: false
22:36:37: OS: Windows 10 Enterprise
22:36:37: OS Arch: AMD64
22:36:37: GPUs: 2
22:36:37: GPU 0: Bus:2 Slot:0 Func:0 NVIDIA:5 GM206 [GeForce GTX 960] 2308
22:36:37: GPU 1: Bus:1 Slot:0 Func:0 NVIDIA:5 GM206 [GeForce GTX 960] 2308
22:36:37: CUDA Device 0: Platform:0 Device:0 Bus:1 Slot:0 Compute:5.2 Driver:10.2
22:36:37: CUDA Device 1: Platform:0 Device:1 Bus:2 Slot:0 Compute:5.2 Driver:10.2
22:36:37:OpenCL Device 0: Platform:0 Device:0 Bus:1 Slot:0 Compute:1.2 Driver:441.87
22:36:37:OpenCL Device 1: Platform:0 Device:1 Bus:2 Slot:0 Compute:1.2 Driver:441.87
22:36:37:******************************* libFAH ********************************
22:36:37: Date: Apr 15 2020
22:36:37: Time: 14:53:14
22:36:37: Revision: 216968bc7025029c841ed6e36e81a03a316890d3
22:36:37: Branch: master
22:36:37: Compiler: Visual C++ 2008
22:36:37: Options: /TP /nologo /EHa /wd4297 /wd4103 /Ox /MT
22:36:37: Platform: win32 10
22:36:37: Bits: 32
22:36:37: Mode: Release
22:36:37:***********************************************************************
Sam
-
- Site Moderator
- Posts: 6986
- Joined: Wed Dec 23, 2009 9:33 am
- Hardware configuration: V7.6.21 -> Multi-purpose 24/7
Windows 10 64-bit
CPU:2/3/4/6 -> Intel i7-6700K
GPU:1 -> Nvidia GTX 1080 Ti
§
Retired:
2x Nvidia GTX 1070
Nvidia GTX 675M
Nvidia GTX 660 Ti
Nvidia GTX 650 SC
Nvidia GTX 260 896 MB SOC
Nvidia 9600GT 1 GB OC
Nvidia 9500M GS
Nvidia 8800GTS 320 MB
Intel Core i7-860
Intel Core i7-3840QM
Intel i3-3240
Intel Core 2 Duo E8200
Intel Core 2 Duo E6550
Intel Core 2 Duo T8300
Intel Pentium E5500
Intel Pentium E5400 - Location: Land Of The Long White Cloud
- Contact:
Re: "Hung download" bug still present in v7.6.13
Thanks for the reports. I have updated the issue with links here: https://github.com/FoldingAtHome/fah-issues/issues/983
ETA:
Now ↞ Very Soon ↔ Soon ↔ Soon-ish ↔ Not Soon ↠ End Of Time
Welcome To The F@H Support Forum Ӂ Troubleshooting Bad WUs Ӂ Troubleshooting Server Connectivity Issues
Now ↞ Very Soon ↔ Soon ↔ Soon-ish ↔ Not Soon ↠ End Of Time
Welcome To The F@H Support Forum Ӂ Troubleshooting Bad WUs Ӂ Troubleshooting Server Connectivity Issues
-
- Posts: 61
- Joined: Sun Mar 22, 2020 10:52 pm
- Hardware configuration: A mishmash of systems little and large, ranging from an Udoo X86 Ultra up to a new beast completed 2020-03-25 with a Ryzen 9 3950X CPU & RTX 2070 GPU. F@H seems to like the 2070!
- Location: Near Penrith, Cumbria, UK
Re: "Hung download" bug still present in v7.6.13
Just a +1 for this problem. It's happened a couple of times recently on different PCs, so I don't think it's a system problem at this end. The only solution I have found is to terminate the client in task manager. Just closing the client and restarting it doesn't work: although the client task bar icon goes away when the client is closed, the restarted client doesn't work (no connection from web control, nothing in advanced control). At that point I terminate the remnants of the client in task manager and then I can restart it OK.
Radio Amateur, light aircraft owner/pilot, computer nerd, mountaineer, organist (sort of) and, now, folder.
-
- Posts: 1996
- Joined: Sun Mar 22, 2020 5:52 pm
- Hardware configuration: 1: 2x Xeon E5-2697v3@2.60GHz, 512GB DDR4 LRDIMM, SSD Raid, Win10 Ent 20H2, Quadro K420 1GB, FAH 7.6.21
2: Xeon E3-1505Mv5@2.80GHz, 32GB DDR4, NVME, Win10 Pro 20H2, Quadro M1000M 2GB, FAH 7.6.21 (actually have two of these)
3: i7-960@3.20GHz, 12GB DDR3, SSD, Win10 Pro 20H2, GTX 750Ti 2GB, GTX 1080Ti 11GB, FAH 7.6.21 - Location: UK
Re: "Hung download" bug still present in v7.6.13
you might want to try downloading tcpview and killing the established connection ... this resolved a similar issue without any need to stop/pause/otherwise the client
2x Xeon E5-2697v3, 512GB DDR4 LRDIMM, SSD Raid, W10-Ent, Quadro K420
Xeon E3-1505Mv5, 32GB DDR4, NVME, W10-Pro, Quadro M1000M
i7-960, 12GB DDR3, SSD, W10-Pro, GTX1080Ti
i9-10850K, 64GB DDR4, NVME, W11-Pro, RTX3070
(Green/Bold = Active)
Xeon E3-1505Mv5, 32GB DDR4, NVME, W10-Pro, Quadro M1000M
i7-960, 12GB DDR3, SSD, W10-Pro, GTX1080Ti
i9-10850K, 64GB DDR4, NVME, W11-Pro, RTX3070
(Green/Bold = Active)
-
- Posts: 61
- Joined: Sun Mar 22, 2020 10:52 pm
- Hardware configuration: A mishmash of systems little and large, ranging from an Udoo X86 Ultra up to a new beast completed 2020-03-25 with a Ryzen 9 3950X CPU & RTX 2070 GPU. F@H seems to like the 2070!
- Location: Near Penrith, Cumbria, UK
Re: "Hung download" bug still present in v7.6.13
That's an interesting idea Neil. Easy to do, so I will try to remember next time I get a download hang.
John
John
Radio Amateur, light aircraft owner/pilot, computer nerd, mountaineer, organist (sort of) and, now, folder.
-
- Posts: 4
- Joined: Sun Apr 19, 2020 2:52 pm
Re: "Hung download" bug still present in v7.6.13
It's a bit annoying to wake up and find out my 2080ti has been idle for 8 hours because of a download that got stuck again.
Happens too often, have to babysit the client...
Really hope this gets fixed.
Happens too often, have to babysit the client...
Really hope this gets fixed.
Re: "Hung download" bug still present in v7.6.13
Same here today, still stuck after 4 hours. Really needs a timeout...
To avoid issues killing the client just delete the slot and recreate it instead. The stuck download will stay in the queue but the new slot works normally.
To avoid issues killing the client just delete the slot and recreate it instead. The stuck download will stay in the queue but the new slot works normally.
-
- Site Admin
- Posts: 7927
- Joined: Tue Apr 21, 2009 4:41 pm
- Hardware configuration: Mac Pro 2.8 quad 12 GB smp4
MacBook Pro 2.9 i7 8 GB smp2 - Location: W. MA
Re: "Hung download" bug still present in v7.6.13
Can only recommend that as a temporary fix, reboot or restart the FAHClient process fairly soon. Prior experience is that the stuck download or upload connected with this bug will cause other problems eventually.Kilrah wrote:Same here today, still stuck after 4 hours. Really needs a timeout...
To avoid issues killing the client just delete the slot and recreate it instead. The stuck download will stay in the queue but the new slot works normally.
iMac 2.8 i7 12 GB smp8, Mac Pro 2.8 quad 12 GB smp6
MacBook Pro 2.9 i7 8 GB smp3