Page 2 of 2

Re: GPU WU completes but doesn't

Posted: Fri Dec 11, 2015 1:20 pm
by toTOW
Everything looks normal in your log and files ...

Do you have something that could interfere with the core drives accesses (anti virus program or something like this) ?

What is the state of FahCore_18.exe in your task manager when it's stuck at writing results ? Does it use some CPU % ?

Re: GPU WU completes but doesn't

Posted: Fri Dec 11, 2015 2:11 pm
by lyndwyrm
I do have an anti-virus running (Avast) and have made exceptions for FAHClient.exe and FAHControl.exe. Should I add FahCore_18.exe and FahCore_a4.exe to the exceptions list (since I'm asking, I figure I should)?

As for the status of FahCore_18.exe in task manager, it's not using any CPU but still has about 233MB of RAM reserved.

Re: GPU WU completes but doesn't

Posted: Fri Dec 11, 2015 9:38 pm
by toTOW
You may try to exclude the whole FAH Data folder from the anti virus analysis but I'm not convinced that it will change anything ...

I'm thinking more and more about a bad WU ... I think you can safely dump it (pause the GPU slot, remove it and recreate it), the server already have something (which is invalid/incomplete) from your client :

Hi lyndwyrm (team 111065),
Your WU (P10472 R0 C69 G264) was added to the stats database on 2015-12-11 12:12:33 for 0 points of credit.

Re: GPU WU completes but doesn't

Posted: Fri Dec 11, 2015 10:32 pm
by lyndwyrm
Eventually, and this was after I had paused work and added both FahCore_18.exe and FahCore_a4.exe to the exceptions list, the WU you mentioned started and failed.

Code: Select all

14:19:37:WU00:FS01:0x18:Project: 10472 (Run 0, Clone 69, Gen 264)
14:19:37:WU00:FS01:0x18:Unit: 0x00000156538b3dbb53beb4f52046b226
14:19:37:WU00:FS01:0x18:CPU: 0x00000000000000000000000000000000
14:19:37:WU00:FS01:0x18:Machine: 1
14:19:37:WU00:FS01:0x18:Reading tar file state.xml
14:19:37:WU00:FS01:0x18:Reading tar file system.xml
14:19:37:WU00:FS01:0x18:Reading tar file integrator.xml
14:19:37:WU00:FS01:0x18:Reading tar file core.xml
14:19:37:WU00:FS01:0x18:Digital signatures verified
14:19:37:WU00:FS01:0x18:Folding@home GPU core18
14:19:37:WU00:FS01:0x18:Version 0.0.4
14:19:53:WU00:FS01:0x18:Completed 0 out of 5000000 steps (0%)
14:19:53:WU00:FS01:0x18:Temperature control disabled. Requirements: single Nvidia GPU, tmax must be < 110 and twait >= 900
14:22:48:WU00:FS01:0x18:Completed 50000 out of 5000000 steps (1%)
14:25:37:WU00:FS01:0x18:Completed 100000 out of 5000000 steps (2%)
14:37:03:WU01:FS01:Starting
14:37:03:WU01:FS01:Running FahCore: "C:\Program Files (x86)\FAHClient/FAHCoreWrapper.exe" C:/Users/joshua/AppData/Roaming/FAHClient/cores/web.stanford.edu/~pande/Win32/AMD64/NVIDIA/Fermi/Core_18.fah/FahCore_18.exe -dir 01 -suffix 01 -version 704 -lifeline 6832 -checkpoint 15 -gpu 0 -gpu-vendor nvidia
14:37:04:WU01:FS01:Started FahCore on PID 20688
14:37:04:WU01:FS01:Core PID:7116
14:37:04:WU01:FS01:FahCore 0x18 started
14:37:04:WARNING:WU00:FS01:FahCore returned an unknown error code which probably indicates that it crashed
14:37:04:WARNING:WU00:FS01:FahCore returned: UNKNOWN_ENUM (127 = 0x7f)
While I didn't think stability would be an issue, if it crashed, then that may be the case, and if that's the case, then I'll put GPU folding on hold until I can investigate that in depth. It also looks like it started WU01 before WU00 failed, but I'm looking at a closed log file now, and I'm not one to question within one second.

Code: Select all

14:37:04:WU01:FS01:0x18:Project: 10484 (Run 0, Clone 172, Gen 60)
~~~~
22:01:22:WU01:FS01:Sending unit results: id:01 state:SEND error:NO_ERROR project:10484 run:0 clone:172 gen:60 core:0x18 unit:0x00000054538b3dbb54ac2e44162e31ed
~~~~
22:02:22:WU01:FS01:Upload 97.49%
22:02:23:WU01:FS01:Upload complete
22:02:23:WU01:FS01:Server responded GOT_ALREADY (434)
22:02:23:WARNING:WU01:FS01:Server did not like results, dumping
22:02:23:WU01:FS01:Cleaning up
Then that work unit had already been received (is that right?). On the bright side, it's running now, so I'll wait and see how the current WU (P9146, R12, C4, G181) goes. Even better, I should be able to see if it finishes tonight.

Re: GPU WU completes but doesn't

Posted: Sat Dec 12, 2015 1:55 am
by bruce
The work files for a CPU assignment and a GPU assignment are different. The CPU WU are all at the same level. A GPU WU contains a subdirectory 01 and the files you're looking for are inside of it.

Re: GPU WU completes but doesn't

Posted: Sat Dec 12, 2015 2:08 am
by lyndwyrm
bruce wrote:The work files for a CPU assignment and a GPU assignment are different. The CPU WU are all at the same level. A GPU WU contains a subdirectory 01 and the files you're looking for are inside of it.
Since I never saw the files in the "01" folder, I will venture a guess that either my anti-virus was preventing the program from functioning properly or that work unit had an issue.

But unless things go awry again, it looks like it is working now

Code: Select all

23:47:11:WU03:FS01:0x18:Saving result file logfile_01.txt
23:47:11:WU03:FS01:0x18:Saving result file checkpointState.xml
23:47:12:WU03:FS01:0x18:Saving result file checkpt.crc
23:47:12:WU03:FS01:0x18:Saving result file log.txt
23:47:12:WU03:FS01:0x18:Saving result file positions.xtc
23:47:13:WU03:FS01:0x18:Folding@home Core Shutdown: FINISHED_UNIT
23:47:13:WU03:FS01:FahCore returned: FINISHED_UNIT (100 = 0x64)
23:47:13:WU03:FS01:Sending unit results: id:03 state:SEND error:NO_ERROR project:9146 run:12 clone:4 gen:181 core:0x18 unit:0x000000d00a3b1e6155a850752c2c2879
~~~~
23:48:01:WU03:FS01:Upload complete
23:48:01:WU03:FS01:Server responded WORK_ACK (400)
23:48:01:WU03:FS01:Final credit estimate, 22439.00 points
23:48:01:WU03:FS01:Cleaning up