Page 5 of 7

Re: Core 17 has suddenly started crashing

Posted: Wed Jun 04, 2014 1:53 am
by PantherX
The only other change in F@H that happened somewhere around mid May was the FahCore location was changed:
Old -> cores/www.stanford.edu/~pande/
New -> cores/web.stanford.edu/~pande/

If your AV application is using the old path, it may explain why the exception(s) won't work. You may have to update the exception(s) to the new path. I can't see any reason why it won't fold. As a test, can you run FAHBench (http://fahbench.com/) successfully on the current setup or not?

Re: Core 17 has suddenly started crashing

Posted: Wed Jun 04, 2014 3:08 am
by bruce

Code: Select all

16:23:32:WU00:FS01:Connecting to 171.67.108.201:80
16:23:34:WU00:FS01:Assigned to work server 140.163.4.231
16:23:34:WU00:FS01:Requesting new work unit for slot 01: READY gpu:0:GK110 [GeForce GTX 780] from 140.163.4.231
16:23:34:WU00:FS01:Connecting to 140.163.4.231:8080
16:23:35:WU00:FS01:Downloading 4.83MiB
16:23:40:WU00:FS01:Download complete
16:23:40:WU00:FS01:Received Unit: id:00 state:DOWNLOAD error:NO_ERROR project:13001 run:434 clone:1 gen:23 core:0x17 unit:0x00000028538b3db75328cabc362813c8
A WU was successfully downloaded.

The anomaly is that it requests the download of a core. That should not be required unless the WU needs a later version of the core than the one you have on disk.

Code: Select all

16:23:40:WU00:FS01:Downloading core from http://web.stanford.edu/~pande/Win32/AMD64/NVIDIA/Fermi/Core_17.fah
16:23:40:WU00:FS01:Connecting to web.stanford.edu:80
16:23:41:WU00:FS01:FahCore 17: Downloading 2.55MiB
16:23:47:WU00:FS01:FahCore 17: 36.76%
16:23:53:WU00:FS01:FahCore 17: 75.97%
16:23:56:WU00:FS01:FahCore 17: Download complete
16:23:56:WU00:FS01:Valid core signature
16:23:56:WARNING:WU00:FS01:FahCore has not changed since last download, aborting core update
In spite of needing a different version, the core failed to be updated. Perhaps the user that's running FAH does not have permission to overwrite the old version. I suggest you check the permissions and then manually delete the core so that the one being downloaded can be written to the directory where it needs to be. Since it's not being updated, I don't trust whatever is being started.

Code: Select all

16:23:56:WU00:FS01:Starting
16:23:56:WU00:FS01:Running FahCore: "C:\Program Files (x86)\FAHClient/FAHCoreWrapper.exe" "X:/Folding At Home/cores/web.stanford.edu/~pande/Win32/AMD64/NVIDIA/Fermi/Core_17.fah/FahCore_17.exe" -dir 00 -suffix 01 -version 704 -lifeline 5640 -checkpoint 15 -gpu 0 -gpu-vendor nvidia
16:23:56:WU00:FS01:Started FahCore on PID 6964
16:23:57:WU00:FS01:Core PID:4132
16:23:57:WU00:FS01:FahCore 0x17 started
Unfortunately you trimmed off the next few messages that look something like this:

Code: Select all

05:24:12:WU02:FS01:0x17:Digital signatures verified
05:24:12:WU02:FS01:0x17:Folding@home GPU core17
05:24:12:WU02:FS01:0x17:Version 0.0.55
I think we've discovered a bug. Following the failure to make a required update to the core, the client should NOT start folding with the obsolete core.

Re: Core 17 has suddenly started crashing

Posted: Wed Jun 04, 2014 8:31 am
by Eagle
PantherX wrote:The only other change in F@H that happened somewhere around mid May was the FahCore location was changed:
Old -> cores/www.stanford.edu/~pande/
New -> cores/web.stanford.edu/~pande/

If your AV application is using the old path, it may explain why the exception(s) won't work. You may have to update the exception(s) to the new path.
I've noticed this when I created the exceptions. I removed the "www."-one and created rules for its "web."-pendant.
PantherX wrote:I can't see any reason why it won't fold. As a test, can you run FAHBench (http://fahbench.com/) successfully on the current setup or not?
Ran it on the current setup with all FAH slots paused, no options set other than Explicit, Implicit and of course OpenCL - F@H and the result was:

Code: Select all

Explicit Solvent: 36.8386 ns/day
Implicit Solvent: 175.744 ns/day
bruce wrote:(...)
The anomaly is that it requests the download of a core. That should not be required unless the WU needs a later version of the core than the one you have on disk.
Yes, that confused me as well.
bruce wrote:In spite of needing a different version, the core failed to be updated. Perhaps the user that's running FAH does not have permission to overwrite the old version. I suggest you check the permissions and then manually delete the core so that the one being downloaded can be written to the directory where it needs to be. Since it's not being updated, I don't trust whatever is being started.
I can assure you in two ways: first, my user is within the admin group and second, write-permission to the path where FAH is running on (i.e. data directory) is granted even to the normal user group.
bruce wrote:(...)
Unfortunately you trimmed off the next few messages that look something like this:

Code: Select all

05:24:12:WU02:FS01:0x17:Digital signatures verified
05:24:12:WU02:FS01:0x17:Folding@home GPU core17
05:24:12:WU02:FS01:0x17:Version 0.0.55
Sorry about that! I've searched the log files again and found the corresponding file and the lines, so here they are (I even left the line from the CPU slot included..):

Code: Select all

16:23:57:WU00:FS01:Core PID:4132
16:23:57:WU00:FS01:FahCore 0x17 started
16:23:58:WU00:FS01:0x17:*********************** Log Started 2014-06-02T16:23:58Z ***********************
16:23:58:WU00:FS01:0x17:Project: 13001 (Run 434, Clone 1, Gen 23)
16:23:58:WU00:FS01:0x17:Unit: 0x00000028538b3db75328cabc362813c8
16:23:58:WU00:FS01:0x17:CPU: 0x00000000000000000000000000000000
16:23:58:WU00:FS01:0x17:Machine: 1
16:23:58:WU00:FS01:0x17:Reading tar file state.xml
16:23:59:WU00:FS01:0x17:Reading tar file system.xml
16:24:00:WU00:FS01:0x17:Reading tar file integrator.xml
16:24:00:WU00:FS01:0x17:Reading tar file core.xml
16:24:00:WU00:FS01:0x17:Digital signatures verified
16:24:00:WU00:FS01:0x17:Folding@home GPU core17
16:24:00:WU00:FS01:0x17:Version 0.0.52
16:26:38:WU02:FS00:0xa3:Completed 270000 out of 500000 steps  (54%)
16:26:45:Started thread 12 on PID 5640
16:27:53:WU00:FS01:0x17:Completed 0 out of 5000000 steps (0%)
16:27:53:WU00:FS01:0x17:Lost lifeline PID 6964, exiting
16:27:53:WU00:FS01:0x17:Lost lifeline PID 6964, exiting
16:27:53:WU00:FS01:0x17:ERROR:103: Lost client lifeline
16:27:53:WU00:FS01:0x17:Folding@home Core Shutdown: CLIENT_DIED
16:27:54:WARNING:WU00:FS01:FahCore returned an unknown error code which probably indicates that it crashed
16:27:54:WARNING:WU00:FS01:FahCore returned: CLIENT_DIED (103 = 0x67)
bruce wrote:I think we've discovered a bug. Following the failure to make a required update to the core, the client should NOT start folding with the obsolete core.
That makes sense to me. However, I guess that this can't be fixed by me, right?
How to get GPU folding working again then?

Re: Core 17 has suddenly started crashing

Posted: Wed Jun 04, 2014 1:08 pm
by 7im
Move (not delete) the fahcore file from the old www location, and let it try to download the replacement again.

Re: Core 17 has suddenly started crashing

Posted: Wed Jun 04, 2014 3:30 pm
by Eagle
I can't since the whole "www.stanford.edu"-folder got deleted and the recycle bin was emptied already..

Re: Core 17 has suddenly started crashing

Posted: Wed Jun 04, 2014 4:41 pm
by 7im
And it still won't download a new fahcore?

Re: Core 17 has suddenly started crashing

Posted: Wed Jun 04, 2014 4:47 pm
by davidcoton
So it must be using a Core17 version from (and stored in) web.stanford.edu. Try finding that Core17 and renaming it to force a new download without a current core in place.

The only other thing I notice is that your working directory is non-standard. I wonder if that is related to the problem -- some part of the process is looking in the wrong place? Possibly relevant to the bug hunt. Or possibly just noise :(

David

Re: Core 17 has suddenly started crashing

Posted: Wed Jun 04, 2014 6:22 pm
by Eagle
7im wrote:And it still won't download a new fahcore?
Nope. Even after deleting Core 17, it downloaded version 0.0.52 again..
davidcoton wrote:So it must be using a Core17 version from (and stored in) web.stanford.edu. Try finding that Core17 and renaming it to force a new download without a current core in place.
As written above, that doesn't help. It sticks to the 0.0.52 one. :(
davidcoton wrote:The only other thing I notice is that your working directory is non-standard. I wonder if that is related to the problem -- some part of the process is looking in the wrong place? Possibly relevant to the bug hunt. Or possibly just noise :(

David
That was set up right from the start over a year ago and I never found any log entry stating "Couldn't find [put a filename in here]"..

Re: Core 17 has suddenly started crashing

Posted: Wed Jun 04, 2014 6:28 pm
by codysluder
Maybe 0.0.52 is the correct core for your system but somehow the project that you're being assigned thinks it is the wrong one. Whether that's true or not, there's nothing you can fix; it's a Stanford problem.

Re: Core 17 has suddenly started crashing

Posted: Wed Jun 04, 2014 6:31 pm
by Eagle
Alright, I'll hope for a fix soon then. :)

Re: Core 17 has suddenly started crashing

Posted: Wed Jun 04, 2014 9:27 pm
by 7im
YGPM.

Re: Core 17 has suddenly started crashing

Posted: Thu Jun 05, 2014 12:35 am
by Eagle
See my reply.

Re: Core 17 has suddenly started crashing

Posted: Thu Jun 05, 2014 6:03 pm
by Eagle
Alright, took some steps and can't understand the result. Here's what I exactly did:

- Finished CPU slot and afterwards uninstalled FAH via Control panel completely
- Downloaded FAH 7.4.4 x86 freshly from stanford.edu
- Removed (!) all previous exceptions for FAH within ESET
- Launched the installer, took the advanced option
- Chose "Install just for me", left both paths untouched
- Same goes for startup and screensaver option, i.e. at login time and no screensaver
- Entered name, team number and passkey within FAHControl
- Removed extra client option for opening web-control upon launch
- Applied/Saved all options
- Paused GPU slot
- Used Unlocker (http://www.emptyloop.com/unlocker/) to free any handle
- Removed the whole slot-folder within C:\Users\USER\AppData\Roaming\FAHClient\work
- Added "client-type" with value "beta" (both without quotes, of course) to the GPU's extra slot options
- Un-paused GPU slot

So, my config now looks like this:

Code: Select all

11:23:13:<config>
11:23:13:  <!-- Network -->
11:23:13:  <proxy v=':8080'/>
11:23:13:
11:23:13:  <!-- Slot Control -->
11:23:13:  <power v='FULL'/>
11:23:13:
11:23:13:  <!-- User Information -->
11:23:13:  <passkey v='********************************'/>
11:23:13:  <team v='34361'/>
11:23:13:  <user v='Eagle3386'/>
11:23:13:
11:23:13:  <!-- Folding Slots -->
11:23:13:  <slot id='0' type='CPU'/>
11:23:13:  <slot id='1' type='GPU'>
11:23:13:    <client-type v='beta'/>
11:23:13:  </slot>
11:23:13:</config>
And the GPU slot's log looks like this:

Code: Select all

11:22:20:FS01:Unpaused
11:22:20:WU01:FS01:Starting
11:22:20:WU01:FS01:Running FahCore: "C:\Program Files (x86)\FAHClient/FAHCoreWrapper.exe" C:/Users/USER/AppData/Roaming/FAHClient/cores/web.stanford.edu/~pande/Win32/AMD64/NVIDIA/Fermi/Core_15.fah/FahCore_15.exe -dir 01 -suffix 01 -version 704 -lifeline 748 -checkpoint 15 -gpu 0 -gpu-vendor nvidia
11:22:20:WU01:FS01:Started FahCore on PID 3992
11:22:20:WU01:FS01:Core PID:7988
11:22:20:WU01:FS01:FahCore 0x15 started
11:22:21:WARNING:WU01:FS01:FahCore returned: MISSING_WORK_FILES (116 = 0x74)
11:22:21:WARNING:WU01:FS01:Fatal error, dumping
11:22:21:WU01:FS01:Sending unit results: id:01 state:SEND error:DUMPED project:7621 run:793 clone:0 gen:228 core:0x15 unit:0x000000fe664f2dd14e43092576c8e4ad
11:22:21:WARNING:WU01:FS01:Missing original Unit data, cannot send dump report
11:22:21:WU01:FS01:Cleaning up
11:22:21:WU01:FS01:Connecting to 171.67.108.201:80
11:22:22:WU01:FS01:Assigned to work server 171.64.65.93
11:22:22:WU01:FS01:Requesting new work unit for slot 01: READY gpu:0:GK110 [GeForce GTX 780] from 171.64.65.93
11:22:22:WU01:FS01:Connecting to 171.64.65.93:8080
11:22:23:WU01:FS01:Downloading 2.93MiB
11:22:28:WU01:FS01:Download complete
11:22:28:WU01:FS01:Received Unit: id:01 state:DOWNLOAD error:NO_ERROR project:9102 run:2 clone:25 gen:47 core:0x17 unit:0x000000300a3b1e81537c06add5e8b634
11:22:28:WU01:FS01:Downloading core from http://web.stanford.edu/~pande/Win32/AMD64/NVIDIA/Fermi/beta/Core_17.fah
11:22:28:WU01:FS01:Connecting to web.stanford.edu:80
11:22:31:WU01:FS01:FahCore 17: Downloading 2.57MiB
11:22:37:WU01:FS01:FahCore 17: 34.05%
11:22:43:WU01:FS01:FahCore 17: 75.40%
11:22:46:WU01:FS01:FahCore 17: Download complete
11:22:47:WU01:FS01:Valid core signature
11:22:47:WU01:FS01:Unpacked 8.68MiB to cores/web.stanford.edu/~pande/Win32/AMD64/NVIDIA/Fermi/beta/Core_17.fah/FahCore_17.exe
11:22:47:WU01:FS01:Starting
11:22:47:WU01:FS01:Running FahCore: "C:\Program Files (x86)\FAHClient/FAHCoreWrapper.exe" C:/Users/USER/AppData/Roaming/FAHClient/cores/web.stanford.edu/~pande/Win32/AMD64/NVIDIA/Fermi/beta/Core_17.fah/FahCore_17.exe -dir 01 -suffix 01 -version 704 -lifeline 748 -checkpoint 15 -gpu 0 -gpu-vendor nvidia
11:22:47:WU01:FS01:Started FahCore on PID 6736
11:22:47:WU01:FS01:Core PID:7144
11:22:47:WU01:FS01:FahCore 0x17 started
11:22:48:WU01:FS01:0x17:*********************** Log Started 2014-06-05T11:22:47Z ***********************
11:22:48:WU01:FS01:0x17:Project: 9102 (Run 2, Clone 25, Gen 47)
11:22:48:WU01:FS01:0x17:Unit: 0x000000300a3b1e81537c06add5e8b634
11:22:48:WU01:FS01:0x17:CPU: 0x00000000000000000000000000000000
11:22:48:WU01:FS01:0x17:Machine: 1
11:22:48:WU01:FS01:0x17:Reading tar file state.xml
11:22:48:WU01:FS01:0x17:Reading tar file system.xml
11:22:49:WU01:FS01:0x17:Reading tar file integrator.xml
11:22:49:WU01:FS01:0x17:Reading tar file core.xml
11:22:49:WU01:FS01:0x17:Digital signatures verified
11:22:49:WU01:FS01:0x17:Folding@home GPU core17
11:22:49:WU01:FS01:0x17:Version 0.0.55
11:24:24:WU01:FS01:0x17:Completed 0 out of 2500000 steps (0%)
11:24:25:WU01:FS01:0x17:Temperature control disabled. Requirements: single Nvidia GPU, tmax must be < 110 and twait >= 900
11:27:20:WU01:FS01:0x17:Completed 25000 out of 2500000 steps (1%)
11:29:56:WU01:FS01:0x17:Completed 50000 out of 2500000 steps (2%)
11:33:00:WU01:FS01:0x17:Completed 75000 out of 2500000 steps (3%)
11:35:42:WU01:FS01:0x17:Completed 100000 out of 2500000 steps (4%)
11:38:35:WU01:FS01:0x17:Completed 125000 out of 2500000 steps (5%)
11:41:15:WU01:FS01:0x17:Completed 150000 out of 2500000 steps (6%)
11:44:17:WU01:FS01:0x17:Completed 175000 out of 2500000 steps (7%)
11:46:57:WU01:FS01:0x17:Completed 200000 out of 2500000 steps (8%)
11:49:50:WU01:FS01:0x17:Completed 225000 out of 2500000 steps (9%)
11:52:35:WU01:FS01:0x17:Completed 250000 out of 2500000 steps (10%)

(...)

15:59:05:WU01:FS01:0x17:Completed 2500000 out of 2500000 steps (100%)
15:59:06:WU02:FS01:Connecting to 171.67.108.201:80
15:59:07:WU02:FS01:Assigned to work server 171.64.65.93
15:59:07:WU02:FS01:Requesting new work unit for slot 01: RUNNING gpu:0:GK110 [GeForce GTX 780] from 171.64.65.93
15:59:07:WU02:FS01:Connecting to 171.64.65.93:8080
15:59:08:WU02:FS01:Downloading 3.11MiB
15:59:14:WU02:FS01:Download 24.08%
15:59:16:WU01:FS01:0x17:Saving result file logfile_01.txt
15:59:16:WU01:FS01:0x17:Saving result file checkpointState.xml
15:59:18:WU01:FS01:0x17:Saving result file checkpt.crc
15:59:18:WU01:FS01:0x17:Saving result file log.txt
15:59:18:WU01:FS01:0x17:Saving result file positions.xtc
15:59:20:WU02:FS01:Download 44.14%
15:59:20:WU01:FS01:0x17:Folding@home Core Shutdown: FINISHED_UNIT
15:59:21:WU01:FS01:FahCore returned: FINISHED_UNIT (100 = 0x64)
15:59:21:WU01:FS01:Sending unit results: id:01 state:SEND error:NO_ERROR project:9102 run:2 clone:25 gen:47 core:0x17 unit:0x000000300a3b1e81537c06add5e8b634
15:59:21:WU01:FS01:Uploading 8.73MiB to 171.64.65.93
15:59:21:WU01:FS01:Connecting to 171.64.65.93:8080
15:59:26:WU02:FS01:Download 62.20%
15:59:27:WU01:FS01:Upload 7.16%
15:59:33:WU02:FS01:Download 78.25%
15:59:33:WU01:FS01:Upload 12.89%
15:59:39:WU02:FS01:Download 88.29%
15:59:39:WU01:FS01:Upload 19.33%
15:59:45:WU01:FS01:Upload 25.78%
15:59:46:WU02:FS01:Download 98.32%
15:59:46:WU02:FS01:Download complete
15:59:46:WU02:FS01:Received Unit: id:02 state:DOWNLOAD error:NO_ERROR project:9102 run:7 clone:19 gen:21 core:0x17 unit:0x000000160a3b1e81537c093e471e925c
15:59:46:WU02:FS01:Starting
15:59:46:WU02:FS01:Running FahCore: "C:\Program Files (x86)\FAHClient/FAHCoreWrapper.exe" C:/Users/USER/AppData/Roaming/FAHClient/cores/web.stanford.edu/~pande/Win32/AMD64/NVIDIA/Fermi/beta/Core_17.fah/FahCore_17.exe -dir 02 -suffix 01 -version 704 -lifeline 748 -checkpoint 15 -gpu 0 -gpu-vendor nvidia
15:59:46:WU02:FS01:Started FahCore on PID 7176
15:59:46:WU02:FS01:Core PID:3344
15:59:46:WU02:FS01:FahCore 0x17 started
15:59:46:WU02:FS01:0x17:*********************** Log Started 2014-06-05T15:59:46Z ***********************
15:59:46:WU02:FS01:0x17:Project: 9102 (Run 7, Clone 19, Gen 21)
15:59:46:WU02:FS01:0x17:Unit: 0x000000160a3b1e81537c093e471e925c
15:59:46:WU02:FS01:0x17:CPU: 0x00000000000000000000000000000000
15:59:46:WU02:FS01:0x17:Machine: 1
15:59:46:WU02:FS01:0x17:Reading tar file state.xml
15:59:47:WU02:FS01:0x17:Reading tar file system.xml
15:59:47:WU02:FS01:0x17:Reading tar file integrator.xml
15:59:47:WU02:FS01:0x17:Reading tar file core.xml
15:59:47:WU02:FS01:0x17:Digital signatures verified
15:59:47:WU02:FS01:0x17:Folding@home GPU core17
15:59:47:WU02:FS01:0x17:Version 0.0.55
15:59:52:WU01:FS01:Upload 31.51%
15:59:58:WU01:FS01:Upload 37.95%
16:00:04:WU01:FS01:Upload 44.40%
16:00:10:WU01:FS01:Upload 50.84%
16:00:16:WU01:FS01:Upload 56.57%
16:00:22:WU01:FS01:Upload 63.02%
16:00:28:WU01:FS01:Upload 69.46%
16:00:35:WU01:FS01:Upload 76.62%
16:00:41:WU01:FS01:Upload 83.07%
16:00:48:WU01:FS01:Upload 90.94%
16:00:54:WU01:FS01:Upload 97.39%
16:01:03:WU01:FS01:Upload complete
16:01:03:WU01:FS01:Server responded WORK_ACK (400)
16:01:03:WU01:FS01:Final credit estimate, 24901.00 points
16:01:03:WU01:FS01:Cleaning up
16:01:11:WU02:FS01:0x17:Completed 0 out of 2500000 steps (0%)
16:01:11:WU02:FS01:0x17:Temperature control disabled. Requirements: single Nvidia GPU, tmax must be < 110 and twait >= 900
16:03:53:WU02:FS01:0x17:Completed 25000 out of 2500000 steps (1%)
16:06:27:WU02:FS01:0x17:Completed 50000 out of 2500000 steps (2%)
16:09:13:WU02:FS01:0x17:Completed 75000 out of 2500000 steps (3%)
16:11:52:WU02:FS01:0x17:Completed 100000 out of 2500000 steps (4%)
16:14:36:WU02:FS01:0x17:Completed 125000 out of 2500000 steps (5%)
16:17:18:WU02:FS01:0x17:Completed 150000 out of 2500000 steps (6%)
16:20:06:WU02:FS01:0x17:Completed 175000 out of 2500000 steps (7%)
16:22:40:WU02:FS01:0x17:Completed 200000 out of 2500000 steps (8%)
16:25:26:WU02:FS01:0x17:Completed 225000 out of 2500000 steps (9%)
16:28:01:WU02:FS01:0x17:Completed 250000 out of 2500000 steps (10%)

(...)

17:17:01:WU02:FS01:0x17:Completed 700000 out of 2500000 steps (28%)
******************************* Date: 2014-06-05 *******************************
17:19:46:WU02:FS01:0x17:Completed 725000 out of 2500000 steps (29%)

(...)

17:56:24:WU02:FS01:0x17:Completed 1050000 out of 2500000 steps (42%)
I don't know why that "Date: 2014-06-05"-line is written into the log, although the day (and hence the date) is still the same..

The only "new thing" I found is this line:

Code: Select all

11:24:25:WU01:FS01:0x17:Temperature control disabled. Requirements: single Nvidia GPU, tmax must be < 110 and twait >= 900
However, 7im told me that it's unused and hence can be ignored.

So, the only real change is the change of the work-directory from my hard disk to my solid-state drive. Can this really be causing "lost lifeline" and things like that? I can hardly imagine that, but then again, I'm just a passionate FAH user, no insider..

Any further information would be greatly appreciated!

Re: Core 17 has suddenly started crashing

Posted: Thu Jun 05, 2014 7:52 pm
by 7im

Re: Core 17 has suddenly started crashing

Posted: Thu Jun 05, 2014 8:40 pm
by Eagle
Thank you! :)