Page 1 of 1

Error message only for A7 jobs found fah_work.log

Posted: Thu Nov 17, 2016 5:05 pm
by juanmit
Hi all,

We just found out some error message from the log files, but it is only for GRO A7 jobs.
We double checked all my finished trajs, and they had no problems.
We think the error meaage is caused by Clients. Do we need to worry about it? How to let our Cleint fix it?

Thanks.
--Hongbin


6:35:50:I2:#37:Client: 'parkut=0x39b39b1f495bbcf1@***.***.***.***' WORK_REQUEST
16:35:50:I1:#37:Job 0x5824df097a60ad27-28 assigned to parkut=0x39b39b1f495bbcf1
16:38:52:I2:#45:Client: 'parkut=0x39b39b1f495bbcf1@***.***.***.***' WORK_RESULTS
16:38:53:I1:#45:Job 0x57f6f671aecfdd6f-18 accepted from parkut=0x39b39b1f495bbcf1
16:41:51:I2:#46:Client: 'parkut=0x39b39b1f495bbcf1@***.***.***.***' WORK_FAILED
16:41:51:W :#46:Client reported Failed Assignment PRCG: 8678 (1, 0, 28)#35 from parkut=0x39b39b1f495bbcf1

Re: Error message only for A7 jobs found fah_work.log

Posted: Thu Nov 17, 2016 5:19 pm
by bruce
If it's really the client, we need to determine if it only appears in client V7.4.4 or if it also appears in the beta V7.4.15. Open a github ticket against the client if necessary.

It's also possible (maybe even likely) that the verbosity settings of FAHCore_a7 are somewhat different than those of earlier FAHCores. There are a number of GROMACS errors which address the science side of things ... which the Donor can do nothing to correct. Most of those errors have not been reported in the client's log, assuming that the scientist will gather than information from the more detailed log uploaded directly from the core.

Re: Error message only for A7 jobs found fah_work.log

Posted: Thu Nov 17, 2016 9:22 pm
by juanmit
Hi Bruce,

Thanks for your reply.

I will wait a couple of more days to collect more log files to see how often/which cleint etc. to have this error and open a github ticket then .

--Hongbin

Re: Error message only for A7 jobs found fah_work.log

Posted: Wed Nov 23, 2016 10:48 am
by toTOW
I'm pretty sure that's an old machine with incompatible OS that can't run the A7 core due to libraries being too old ...

If my old would get one of your projects, you'll find similar immediate failures ... maybe it already did ...

Re: Error message only for A7 jobs found fah_work.log

Posted: Wed Nov 23, 2016 11:15 pm
by bruce
toTOW wrote:I'm pretty sure that's an old machine with incompatible OS that can't run the A7 core due to libraries being too old ...
Perhaps.

@juanmit:
You should be able to restrict assignments based on OS type and subtype, though I don't know the specific settings. Please check with Joseph regarding which restrictions should be used to insure that the proper libraries are available. FAHClient says that it is compiled/linked with WindowsXP but I'm not sure if FAHCore_a7 still works with WindowsXP..

A7 jobs failing

Posted: Fri Nov 25, 2016 1:39 pm
by parkut
From the post above, I spotted my user name

Code: Select all

6:35:50:I2:#37:Client: 'parkut=0x39b39b1f495bbcf1@69.14.191.138' WORK_REQUEST
It seems I have been getting the following errors for some time now across my "mini-farm"

07:13:01:WARNING:WU01:FS00:FahCore returned: FAILED_2 (1 = 0x1)
07:13:01:WARNING:WU01:FS00:Too many errors, failing

I manage over 20 CPU only machines, all running CentOS 5.11 and 6 GPU/CPU machines running Ubuntu 12.04.3

The errors are seen in both OS, multiple machines.

Code: Select all

This is a typical CentOS machine startup log
*********************** Log Started 2016-11-03T21:04:45Z ***********************
21:04:45:************************* Folding@home Client *************************
21:04:45:    Website: http://folding.stanford.edu/
21:04:45:  Copyright: (c) 2009-2014 Stanford University
21:04:45:     Author: Joseph Coffland <joseph@cauldrondevelopment.com>
21:04:45:       Args: --child --lifeline 2933 /etc/fahclient/config.xml --run-as
21:04:45:             fahclient --pid-file=/var/run/fahclient.pid --daemon
21:04:45:     Config: /etc/fahclient/config.xml
21:04:45:******************************** Build ********************************
21:04:45:    Version: 7.4.4
21:04:45:       Date: Mar 4 2014
21:04:45:       Time: 12:01:17
21:04:45:    SVN Rev: 4130
21:04:45:     Branch: fah/trunk/client
21:04:45:   Compiler: GNU 4.1.2 20080704 (Red Hat 4.1.2-46)
21:04:45:    Options: -std=gnu++98 -O3 -funroll-loops -mfpmath=sse -ffast-math
21:04:45:             -fno-unsafe-math-optimizations -msse2
21:04:45:   Platform: linux2 2.6.18-164.11.1.el5
21:04:45:       Bits: 64
21:04:45:       Mode: Release
21:04:45:******************************* System ********************************
21:04:45:        CPU: Intel(R) Core(TM)2 Quad CPU Q9400 @ 2.66GHz
21:04:45:     CPU ID: GenuineIntel Family 6 Model 23 Stepping 10
21:04:45:       CPUs: 4
21:04:45:     Memory: 1.95GiB
21:04:45:Free Memory: 1.72GiB
21:04:45:    Threads: POSIX_THREADS
21:04:45: OS Version: 2.6
21:04:45:Has Battery: false
21:04:45: On Battery: false
21:04:45: UTC Offset: -4
21:04:45:        PID: 2935
21:04:45:        CWD: /var/lib/fahclient
21:04:45:         OS: Linux 2.6.18-416.el5 x86_64
21:04:45:    OS Arch: AMD64
21:04:45:       GPUs: 0
21:04:45:       CUDA: Not detected
21:04:45:***********************************************************************
21:04:45:<config>
21:04:45:  <!-- HTTP Server -->
21:04:45:  <allow v='127.0.0.1,192.168.1.1-192.168.1.255'/>
21:04:45:
21:04:45:  <!-- Network -->
21:04:45:  <proxy v='192.168.1.142:3128'/>
21:04:45:
21:04:45:  <!-- Remote Command Server -->
21:04:45:  <command-allow-no-pass v='127.0.0.1,192.168.1.1-192.168.1.255'/>
21:04:45:
21:04:45:  <!-- Slot Control -->
21:04:45:  <power v='full'/>
21:04:45:
21:04:45:  <!-- User Information -->
21:04:45:  <passkey v='********************************'/>
21:04:45:  <team v='4'/>
21:04:45:  <user v='parkut'/>
21:04:45:
21:04:45:  <!-- Folding Slots -->
21:04:45:  <slot id='0' type='CPU'>
21:04:45:    <client-type v='advanced'/>
21:04:45:  </slot>
21:04:45:</config>

Code: Select all

This is a typical Ubuntu Startup Log
*********************** Log Started 2016-11-02T13:27:48Z ***********************
13:27:48:************************* Folding@home Client *************************
13:27:48:    Website: http://folding.stanford.edu/
13:27:48:  Copyright: (c) 2009-2014 Stanford University
13:27:48:     Author: Joseph Coffland <joseph@cauldrondevelopment.com>
13:27:48:       Args: --child --lifeline 1325 /etc/fahclient/config.xml --run-as
13:27:48:             fahclient --pid-file=/var/run/fahclient.pid --daemon
13:27:48:     Config: /etc/fahclient/config.xml
13:27:48:******************************** Build ********************************
13:27:48:    Version: 7.4.4
13:27:48:       Date: Mar 4 2014
13:27:48:       Time: 12:02:38
13:27:48:    SVN Rev: 4130
13:27:48:     Branch: fah/trunk/client
13:27:48:   Compiler: GNU 4.4.7
13:27:48:    Options: -std=gnu++98 -O3 -funroll-loops -mfpmath=sse -ffast-math
13:27:48:             -fno-unsafe-math-optimizations -msse2
13:27:48:   Platform: linux2 3.2.0-1-amd64
13:27:48:       Bits: 64
13:27:48:       Mode: Release
13:27:48:******************************* System ********************************
13:27:48:        CPU: Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz
13:27:48:     CPU ID: GenuineIntel Family 6 Model 58 Stepping 9
13:27:48:       CPUs: 8
13:27:48:     Memory: 3.57GiB
13:27:48:Free Memory: 3.36GiB
13:27:48:    Threads: POSIX_THREADS
13:27:48: OS Version: 3.8
13:27:48:Has Battery: false
13:27:48: On Battery: false
13:27:48: UTC Offset: -4
13:27:48:        PID: 1327
13:27:48:        CWD: /var/lib/fahclient
13:27:48:         OS: Linux 3.8.0-29-generic x86_64
13:27:48:    OS Arch: AMD64
13:27:48:       GPUs: 1
13:27:48:      GPU 0: NVIDIA:5 GM204 [GeForce GTX 970]
13:27:48:       CUDA: 5.2
13:27:48:CUDA Driver: 7050
13:27:48:***********************************************************************
13:27:48:<config>
13:27:48:  <!-- Client Control -->
13:27:48:  <fold-anon v='true'/>
13:27:48:
13:27:48:  <!-- Folding Slot Configuration -->
13:27:48:  <gpu v='false'/>
13:27:48:
13:27:48:  <!-- HTTP Server -->
13:27:48:  <allow v='127.0.0.1,192.168.1.1-192.168.1.255'/>
13:27:48:
13:27:48:  <!-- Network -->
13:27:48:  <proxy v='192.168.1.142:3128'/>
13:27:48:
13:27:48:  <!-- Remote Command Server -->
13:27:48:  <command-allow-no-pass v='127.0.0.1,192.168.1.1-192.168.1.255'/>
13:27:48:
13:27:48:  <!-- Slot Control -->
13:27:48:  <power v='full'/>
13:27:48:
13:27:48:  <!-- User Information -->
13:27:48:  <passkey v='********************************'/>
13:27:48:  <team v='4'/>
13:27:48:  <user v='parkut'/>
13:27:48:
13:27:48:  <!-- Folding Slots -->
13:27:48:  <slot id='0' type='CPU'>
13:27:48:    <cpus v='6'/>
13:27:48:  </slot>
13:27:48:  <slot id='1' type='GPU'>
13:27:48:    <client-type v='advanced'/>
13:27:48:  </slot>
13:27:48:</config>

Error message for A7 job

Posted: Fri Nov 25, 2016 1:41 pm
by parkut
Typical failed a7 from CentOS log

Code: Select all

07:04:55:WU00:FS00:0xa4:Completed 245000 out of 250000 steps  (98%)
07:07:19:WU00:FS00:0xa4:Completed 247500 out of 250000 steps  (99%)
07:07:20:WU01:FS00:Connecting to 171.67.108.45:8080
07:07:20:WU01:FS00:Assigned to work server 155.247.166.219
07:07:20:WU01:FS00:Requesting new work unit for slot 00: RUNNING cpu:4 from 155.247.166.219
07:07:20:WU01:FS00:Connecting to 155.247.166.219:8080
07:07:21:WU01:FS00:Downloading 375.87KiB
07:07:21:WU01:FS00:Download complete
07:07:21:WU01:FS00:Received Unit: id:01 state:DOWNLOAD error:NO_ERROR project:8677 run:12 clone:0 gen:14 core:0xa7 unit:0x000000150002894b5824db77281dc6b5
07:09:44:WU00:FS00:0xa4:Completed 250000 out of 250000 steps  (100%)
07:09:44:WU00:FS00:0xa4:DynamicWrapper: Finished Work Unit: sleep=10000
07:09:54:WU00:FS00:0xa4:
07:09:54:WU00:FS00:0xa4:Finished Work Unit:
07:09:54:WU00:FS00:0xa4:- Reading up to 811608 from "00/wudata_01.trr": Read 811608
07:09:54:WU00:FS00:0xa4:trr file hash check passed.
07:09:54:WU00:FS00:0xa4:- Reading up to 746116 from "00/wudata_01.xtc": Read 746116
07:09:54:WU00:FS00:0xa4:xtc file hash check passed.
07:09:54:WU00:FS00:0xa4:edr file hash check passed.
07:09:54:WU00:FS00:0xa4:logfile size: 23427
07:09:54:WU00:FS00:0xa4:Leaving Run
07:09:57:WU00:FS00:0xa4:- Writing 1583639 bytes of core data to disk...
07:09:58:WU00:FS00:0xa4:Done: 1583127 -> 1538106 (compressed to 97.1 percent)
07:09:58:WU00:FS00:0xa4:  ... Done.
07:09:58:WU00:FS00:0xa4:- Shutting down core
07:09:58:WU00:FS00:0xa4:
07:09:58:WU00:FS00:0xa4:Folding@home Core Shutdown: FINISHED_UNIT
07:09:58:WU00:FS00:FahCore returned: FINISHED_UNIT (100 = 0x64)
07:09:58:WU00:FS00:Sending unit results: id:00 state:SEND error:NO_ERROR project:9035 run:615 clone:2 gen:570 core:0xa4 unit:0x00000276ab436c9e56982e0a4b7b427b
07:09:58:WU00:FS00:Uploading 1.47MiB to 171.67.108.158
07:09:58:WU00:FS00:Connecting to 171.67.108.158:8080
07:09:59:WU01:FS00:Starting
07:09:59:WU01:FS00:Running FahCore: /usr/bin/FAHCoreWrapper /var/lib/fahclient/cores/web.stanford.edu/~pande/Linux/AMD64/Core_a7.fah/FahCore_a7 -dir 01 -suffix 01 -version 704 -lifeline 2935 -checkpoint 15 -np 4
07:09:59:WU01:FS00:Started FahCore on PID 419
07:09:59:WU01:FS00:Core PID:423
07:09:59:WU01:FS00:FahCore 0xa7 started
07:09:59:WARNING:WU01:FS00:FahCore returned: FAILED_2 (1 = 0x1)
07:09:59:WU01:FS00:Starting
07:09:59:WU01:FS00:Running FahCore: /usr/bin/FAHCoreWrapper /var/lib/fahclient/cores/web.stanford.edu/~pande/Linux/AMD64/Core_a7.fah/FahCore_a7 -dir 01 -suffix 01 -version 704 -lifeline 2935 -checkpoint 15 -np 4
07:09:59:WU01:FS00:Started FahCore on PID 424
07:09:59:WU01:FS00:Core PID:428
07:09:59:WU01:FS00:FahCore 0xa7 started
07:10:00:WARNING:WU01:FS00:FahCore returned: FAILED_2 (1 = 0x1)
07:10:01:WU00:FS00:Upload complete
07:10:01:WU00:FS00:Server responded WORK_ACK (400)
07:10:01:WU00:FS00:Final credit estimate, 1258.00 points
07:10:01:WU00:FS00:Cleaning up
07:11:00:WU01:FS00:Starting
07:11:00:WU01:FS00:Running FahCore: /usr/bin/FAHCoreWrapper /var/lib/fahclient/cores/web.stanford.edu/~pande/Linux/AM
D64/Core_a7.fah/FahCore_a7 -dir 01 -suffix 01 -version 704 -lifeline 2935 -checkpoint 15 -np 4
07:11:00:WU01:FS00:Started FahCore on PID 431
07:11:00:WU01:FS00:Core PID:435
07:11:00:WU01:FS00:FahCore 0xa7 started
07:11:00:WARNING:WU01:FS00:FahCore returned: FAILED_2 (1 = 0x1)
07:12:00:WU01:FS00:Starting
07:12:00:WU01:FS00:Running FahCore: /usr/bin/FAHCoreWrapper /var/lib/fahclient/cores/web.stanford.edu/~pande/Linux/AMD64/Core_a7.fah/FahCore_a7 -dir 01 -suffix 01 -version 704 -lifeline 2935 -checkpoint 15 -np 4
07:12:00:WU01:FS00:Started FahCore on PID 437
07:12:00:WU01:FS00:Core PID:441
07:12:00:WU01:FS00:FahCore 0xa7 started
07:12:00:WARNING:WU01:FS00:FahCore returned: FAILED_2 (1 = 0x1)
07:13:00:WU01:FS00:Starting
07:13:00:WU01:FS00:Running FahCore: /usr/bin/FAHCoreWrapper /var/lib/fahclient/cores/web.stanford.edu/~pande/Linux/AMD64/Core_a7.fah/FahCore_a7 -dir 01 -suffix 01 -version 704 -lifeline 2935 -checkpoint 15 -np 4
07:13:00:WU01:FS00:Started FahCore on PID 617
07:13:00:WU01:FS00:Core PID:621
07:13:00:WU01:FS00:FahCore 0xa7 started
07:13:01:WARNING:WU01:FS00:FahCore returned: FAILED_2 (1 = 0x1)
07:13:01:WARNING:WU01:FS00:Too many errors, failing
07:13:01:WU01:FS00:Sending unit results: id:01 state:SEND error:FAILED project:8677 run:12 clone:0 gen:14 core:0xa7 unit:0x000000150002894b5824db77281dc6b5
07:13:01:WU01:FS00:Connecting to 155.247.166.219:8080
07:13:01:WU01:FS00:Server responded WORK_ACK (400)
07:13:01:WU01:FS00:Cleaning up
07:13:01:WU00:FS00:Connecting to 171.67.108.45:8080
07:13:01:WU00:FS00:Assigned to work server 171.67.108.158
07:13:01:WU00:FS00:Requesting new work unit for slot 00: READY cpu:4 from 171.67.108.158
07:13:01:WU00:FS00:Connecting to 171.67.108.158:8080
07:13:02:WU00:FS00:Downloading 806.90KiB
07:13:03:WU00:FS00:Download complete
07:13:03:WU00:FS00:Received Unit: id:00 state:DOWNLOAD error:NO_ERROR project:9036 run:748 clone:2 gen:419 core:0xa4 unit:0x000001dcab436c9e56982c4523a37bfa
07:13:03:WU00:FS00:Starting
07:13:03:WU00:FS00:Running FahCore: /usr/bin/FAHCoreWrapper /var/lib/fahclient/cores/web.stanford.edu/~pande/Linux/AMD64/Core_a4.fah/FahCore_a4 -dir 00 -suffix 01 -version 704 -lifeline 2935 -checkpoint 15 -np 4
07:13:03:WU00:FS00:Started FahCore on PID 622
07:13:03:WU00:FS00:Core PID:626
07:13:03:WU00:FS00:FahCore 0xa4 started
07:13:03:WU00:FS00:0xa4:
07:13:03:WU00:FS00:0xa4:*------------------------------*
07:13:03:WU00:FS00:0xa4:Folding@Home Gromacs GB Core
07:13:03:WU00:FS00:0xa4:Version 2.27 (Dec. 15, 2010)
07:13:03:WU00:FS00:0xa4:
07:13:03:WU00:FS00:0xa4:Preparing to commence simulation
07:13:03:WU00:FS00:0xa4:- Looking at optimizations...
07:13:03:WU00:FS00:0xa4:- Created dyn
07:13:03:WU00:FS00:0xa4:- Files status OK
07:13:03:WU00:FS00:0xa4:- Expanded 825756 -> 1403132 (decompressed 169.9 percent)
07:13:03:WU00:FS00:0xa4:Called DecompressByteArray: compressed_data_size=825756 data_size=1403132, decompressed_data_size=1403132 diff=0
07:13:03:WU00:FS00:0xa4:- Digital signature verified
07:13:03:WU00:FS00:0xa4:
07:13:03:WU00:FS00:0xa4:Project: 9036 (Run 748, Clone 2, Gen 419)
07:13:03:WU00:FS00:0xa4:
07:13:03:WU00:FS00:0xa4:Assembly optimizations on if available.
07:13:03:WU00:FS00:0xa4:Entering M.D.
07:13:09:WU00:FS00:0xa4:Completed 0 out of 250000 steps  (0%)
07:15:32:WU00:FS00:0xa4:Completed 2500 out of 250000 steps  (1%)
07:17:54:WU00:FS00:0xa4:Completed 5000 out of 250000 steps  (2%)

Re: Error message only for A7 jobs found fah_work.log

Posted: Sat Nov 26, 2016 6:50 pm
by toTOW
Yes, it is the same issue as mine ... here is also one report from the same Ubuntu as yours : viewtopic.php?f=105&t=29271

Re: Error message only for A7 jobs found fah_work.log

Posted: Sat Dec 17, 2016 1:45 am
by parkut
Just to report after installing CentOS 7.3 on one test machine, able to run p8676 using core a7 without any problems.