Page 1 of 2

Project: 3866 (Run 2597, Clone 0, Gen 0)

Posted: Sun Jun 12, 2011 2:42 am
by GreyWhiskers
Mod Edit: Copied from viewtopic.php?p=188716#p188716 - PantherX

...BTW, this Quote section is copied from the CODE posting below, to show what I think may be the more interesting points.
01:55:58:Unit 01:Will resume from checkpoint file
01:55:58:Unit 01:Tpr hash 01/wudata_01.tpr: 2467527010 3895228313 3608002700 918698187 886431637
01:55:59:Unit 00:Resuming from checkpoint
01:55:59:Unit 00:Verified 00/wudata_01.log
01:56:01:Unit 00:Verified 00/wudata_01.trr
01:56:01:Unit 00:Verified 00/wudata_01.xtc
01:56:01:Unit 00:Verified 00/wudata_01.edr
01:56:06:Unit 00:Completed 89000 out of 250000 steps (35%)
01:56:06:Unit 00:Gromacs cannot continue further.
01:56:06:Unit 00:Going to send back what have done -- stepsTotalG=250000
01:56:06:Unit 00:Work fraction=0.3560 steps=250000.

SNIP

01:57:34:FahCore running Unit 00 returned: UNSTABLE_MACHINE (122)
01:57:34:Starting Unit 00
01:57:34:Running core: "C:/Documents and Settings/All Users/Application Data/FAHClient/cores/www.stanford.edu/~pande/Win32/x86/Core_ ... ore_a6.exe" -dir 00 -suffix 01 -lifeline 5288 -version 701 -checkpoint 15 -forceasm
01:57:34:Started core on PID 4332
01:57:34:FahCore 0xa6 started
01:57:35:FahCore running Unit 00 returned: MISSING_WORK_FILES (116)
01:57:35:WARNING: Unit 00 Fatal error, dumping
01:57:36:Sending unit results: id:00 state:SEND project:3866 run:2597 clone:0 gen:0 core:0xa6 unit:0x00000002000000654dd9e8cc98da84f5
01:57:36:Unit 00: Uploading 9.79KiB

Code: Select all

*********************** Log Started 12/Jun/2011-01:54:45 ***********************
01:54:45:************************* Folding@home Client *************************
01:54:45:      Website: http://folding.stanford.edu/
01:54:45:    Copyright: (c) 2009,2010 Stanford University
01:54:45:       Author: Joseph Coffland <joseph@cauldrondevelopment.com>
01:54:45:         Args: --lifeline 4584
01:54:45:       Config: C:/Documents and Settings/All Users/Application
01:54:45:               Data/FAHClient/config.xml
01:54:45:******************************** Build ********************************
01:54:45:      Version: 7.1.21
01:54:45:         Date: Mar 23 2011
01:54:45:         Time: 15:13:48
01:54:45:      SVN Rev: 2883
01:54:45:       Branch: fah/trunk/client
01:54:45:     Compiler: Intel(R) C++ MSVC 1500 mode 1110
01:54:45:      Options: /TP /nologo /EHa /wd4297 /wd4103 /wd1786 /Ox -arch:SSE
01:54:45:               /QaxSSE2,SSE3,SSSE3,SSE4.1,SSE4.2 /Qrestrict /MT
01:54:45:     Platform: win32 XP
01:54:45:         Bits: 32
01:54:45:         Mode: Release
01:54:45:******************************* System ********************************
01:54:45:           OS: Microsoft Windows XP Home Edition
01:54:45:          CPU: Intel(R) Pentium(R) 4 CPU 3.20GHz
01:54:45:       CPU ID: GenuineIntel Family 15 Model 2 Stepping 9
01:54:45:         CPUs: 2
01:54:45:       Memory: 2.00GiB
01:54:45:  Free Memory: 947.41MiB
01:54:45:      Threads: WINDOWS_THREADS
01:54:45:         GPUs: 2
01:54:45:        GPU 0: ATI:2 RV730 Pro AGP [Radeon HD 4600 Series]
01:54:45:        GPU 1: RV710/730
01:54:45:         CUDA: Not detected
01:54:45:   On Battery: false
01:54:45:   UTC offset: -7
01:54:45:          PID: 5288
01:54:45:          CWD: C:/Documents and Settings/All Users/Application Data/FAHClient
01:54:45:Win32 Service: false
01:54:45:***********************************************************************
01:54:45:<config>
01:54:45:  <service-description v='Folding@home Client'/>
01:54:45:  <service-restart v='true'/>
01:54:45:  <service-restart-delay v='5000'/>
01:54:45:
01:54:45:  <!-- Client Control -->
01:54:45:  <cycle-rate v='4'/>
01:54:45:  <cycles v='-1'/>
01:54:45:  <data-directory v='.'/>
01:54:45:  <exec-directory v='C:\Program Files\FAHClient'/>
01:54:45:  <exit-when-done v='false'/>
01:54:45:  <max-delay v='21600'/>
01:54:45:  <min-delay v='60'/>
01:54:45:  <threads v='4'/>
01:54:45:
01:54:45:  <!-- Configuration -->
01:54:45:  <config-rotate v='true'/>
01:54:45:  <config-rotate-dir v='configs'/>
01:54:45:  <config-rotate-max v='16'/>
01:54:45:
01:54:45:  <!-- Debugging -->
01:54:45:  <assignment-servers>
01:54:45:    assign3.stanford.edu:8080 assign4.stanford.edu:80
01:54:45:  </assignment-servers>
01:54:45:  <capture-directory v='capture'/>
01:54:45:  <capture-sockets v='false'/>
01:54:45:  <debug-sockets v='false'/>
01:54:45:  <exception-locations v='true'/>
01:54:45:  <gpu-assignment-servers>
01:54:45:    assign-GPU.stanford.edu:80 assign-GPU.stanford.edu:8080
01:54:45:  </gpu-assignment-servers>
01:54:45:  <stack-traces v='false'/>
01:54:45:
01:54:45:  <!-- Error Handling -->
01:54:45:  <max-slot-errors v='5'/>
01:54:45:  <max-unit-errors v='5'/>
01:54:45:
01:54:45:  <!-- FahCore Control -->
01:54:45:  <checkpoint v='15'/>
01:54:45:  <core-dir v='cores'/>
01:54:45:  <core-priority v='idle'/>
01:54:45:  <cpu-affinity v='false'/>
01:54:45:  <cpu-usage v='100'/>
01:54:45:  <no-assembly v='false'/>
01:54:45:
01:54:45:  <!-- Folding Slot Configuration -->
01:54:45:  <client-subtype v='STDCLI'/>
01:54:45:  <client-type v='normal'/>
01:54:45:  <cpu-species v='UNKNOWN'/>
01:54:45:  <cpu-type v='X86'/>
01:54:45:  <cpus v='2'/>
01:54:45:  <extra-core-args v='-forceasm'/>
01:54:45:  <gpu v='false'/>
01:54:45:  <gpu-id v='0'/>
01:54:45:  <max-packet-size v='normal'/>
01:54:45:  <os-species v='WIN_XP'/>
01:54:45:  <os-type v='WIN32'/>
01:54:45:  <project-key v='0'/>
01:54:45:  <smp v='true'/>
01:54:45:
01:54:45:  <!-- Logging -->
01:54:45:  <log v='log.txt'/>
01:54:45:  <log-color v='false'/>
01:54:45:  <log-crlf v='true'/>
01:54:45:  <log-date v='false'/>
01:54:45:  <log-debug v='true'/>
01:54:45:  <log-domain v='false'/>
01:54:45:  <log-header v='true'/>
01:54:45:  <log-level v='true'/>
01:54:45:  <log-no-info-header v='true'/>
01:54:45:  <log-redirect v='false'/>
01:54:45:  <log-rotate v='true'/>
01:54:45:  <log-rotate-dir v='logs'/>
01:54:45:  <log-rotate-max v='16'/>
01:54:45:  <log-short-level v='false'/>
01:54:45:  <log-simple-domains v='true'/>
01:54:45:  <log-thread-id v='false'/>
01:54:45:  <log-time v='true'/>
01:54:45:  <log-to-screen v='true'/>
01:54:45:  <log-truncate v='false'/>
01:54:45:  <verbosity v='4'/>
01:54:45:
01:54:45:  <!-- Process Control -->
01:54:45:  <child v='false'/>
01:54:45:  <daemon v='false'/>
01:54:45:  <pid v='false'/>
01:54:45:  <pid-file v='Folding@home Client.pid'/>
01:54:45:  <respawn v='false'/>
01:54:45:  <service v='false'/>
01:54:45:
01:54:45:  <!-- Remote Command Server -->
01:54:45:  <command-address v='0.0.0.0'/>
01:54:45:  <command-allow v='127.0.0.1'/>
01:54:45:  <command-allow-no-pass v='127.0.0.1'/>
01:54:45:  <command-deny v='0.0.0.0/0'/>
01:54:45:  <command-deny-no-pass v='0.0.0.0/0'/>
01:54:45:  <command-port v='36330'/>
01:54:45:  <password v=''/>
01:54:45:
01:54:45:  <!-- Slot Control -->
01:54:45:  <max-shutdown-wait v='60'/>
01:54:45:  <pause-on-battery v='false'/>
01:54:45:  <pause-on-start v='false'/>
01:54:45:
01:54:45:  <!-- User Information -->
01:54:45:  <machine-id v='0'/>
01:54:45:  <passkey v='********************************'/>
01:54:45:  <team v='0'/>
01:54:45:  <user v='GreyWhiskers'/>
01:54:45:
01:54:45:  <!-- Work Unit Control -->
01:54:45:  <dump-after-deadline v='true'/>
01:54:45:  <max-queue v='16'/>
01:54:45:  <max-units v='0'/>
01:54:45:  <next-unit-percentage v='99'/>
01:54:45:
01:54:45:  <!-- Folding Slots -->
01:54:45:  <slot id='1' type='GPU'>
01:54:45:    <client-type v='advanced'/>
01:54:45:    <core-priority v='low'/>
01:54:45:  </slot>
01:54:45:  <slot id='0' type='UNIPROCESSOR'/>
01:54:45:</config>
01:54:54:Enabled folding slot 01: READY gpu:0:"RV730 Pro AGP [Radeon HD 4600 Series]"
01:54:54:Enabled folding slot 00: READY uniprocessor
01:54:55:Server connection id=1 on 0.0.0.0:36330 from 127.0.0.1:1106
01:54:55:Starting Unit 00
01:54:56:Running core: "C:/Documents and Settings/All Users/Application Data/FAHClient/cores/www.stanford.edu/~pande/Win32/x86/Core_a6.fah/FahCore_a6.exe" -dir 00 -suffix 01 -lifeline 5288 -version 701 -checkpoint 15 -forceasm
01:55:01:Server connection id=2 on 0.0.0.0:36330 from 192.168.10.193:2692
01:55:05:Server connection id=3 on 0.0.0.0:36330 from 127.0.0.1:1107
01:55:12:Server connection id=4 on 0.0.0.0:36330 from 192.168.10.193:2693
01:55:16:Server connection id=5 on 0.0.0.0:36330 from 127.0.0.1:1108
01:55:20:Started core on PID 628
01:55:20:FahCore 0xa6 started
01:55:22:Unit 00:
01:55:22:Unit 00:*------------------------------*
01:55:22:Unit 00:Folding@Home Gromacs Core
01:55:22:Unit 00:Version 2.28 (Wed Mar 23 13:51:17 PDT 2011)
01:55:22:Unit 00:
01:55:22:Unit 00:Preparing to commence simulation
01:55:22:Unit 00:- Ensuring status. Please wait.
01:55:22:Server connection id=6 on 0.0.0.0:36330 from 192.168.10.193:2694
01:55:22:Starting Unit 01
01:55:22:Running core: "C:/Documents and Settings/All Users/Application Data/FAHClient/cores/www.stanford.edu/~pande/Win32/x86/ATI/R600/Core_11.fah/FahCore_11.exe" -dir 01 -suffix 01 -lifeline 5288 -version 701 -checkpoint 15 -gpu 0 -forceasm
01:55:26:Server connection id=7 on 0.0.0.0:36330 from 127.0.0.1:1109
01:55:31:Unit 00:- Assembly optimizations manually forced on.
01:55:32:Server connection id=8 on 0.0.0.0:36330 from 192.168.10.193:2695
01:55:33:Unit 00:- Not checking prior termination.
01:55:36:Unit 00:- Expanded 1009800 -> 2412860 (decompressed 238.9 percent)
01:55:36:Unit 00:Called DecompressByteArray: compressed_data_size=1009800 data_size=2412860, decompressed_data_size=2412860 diff=0
01:55:36:Unit 00:- Digital signature verified
01:55:36:Unit 00:
01:55:36:Unit 00:Project: 3866 (Run 2597, Clone 0, Gen 0)
01:55:36:Unit 00:
01:55:36:Server connection id=9 on 0.0.0.0:36330 from 127.0.0.1:1111
01:55:37:Started core on PID 1852
01:55:37:FahCore 0x11 started
01:55:39:Unit 00:Assembly optimizations on if available.
01:55:39:Unit 00:Entering M.D.
01:55:40:Unit 01:
01:55:40:Unit 01:*------------------------------*
01:55:40:Unit 01:Folding@Home GPU Core - Beta
01:55:40:Unit 01:Version 1.24 (Mon Feb 9 11:00:12 PST 2009)
01:55:41:Unit 01:
01:55:41:Unit 01:Compiler  : Microsoft (R) 32-bit C/C++ Optimizing Compiler Version 14.00.50727.762 for 80x86 
01:55:41:Unit 01:Build host: amoeba
01:55:41:Unit 01:Board Type: AMD
01:55:41:Unit 01:Core      : 
01:55:41:Unit 01:Preparing to commence simulation
01:55:41:Unit 01:- Ensuring status. Please wait.
01:55:47:Unit 00:Using Gromacs checkpoints
01:55:49:Unit 01:- Assembly optimizations manually forced on.
01:55:49:Unit 01:- Not checking prior termination.
01:55:50:Unit 00:Mapping NT from 1 to 1 
01:55:50:Unit 01:- Expanded 98709 -> 492188 (decompressed 498.6 percent)
01:55:50:Unit 01:Called DecompressByteArray: compressed_data_size=98709 data_size=492188, decompressed_data_size=492188 diff=0
01:55:50:Unit 01:- Digital signature verified
01:55:50:Unit 01:
01:55:50:Unit 01:Project: 5734 (Run 3, Clone 598, Gen 424)
01:55:50:Unit 01:
01:55:50:Unit 01:Assembly optimizations on if available.
01:55:50:Unit 01:Entering M.D.
01:55:58:Unit 01:Will resume from checkpoint file
01:55:58:Unit 01:Tpr hash 01/wudata_01.tpr:  2467527010 3895228313 3608002700 918698187 886431637
01:55:59:Unit 00:Resuming from checkpoint
01:55:59:Unit 00:Verified 00/wudata_01.log
01:56:01:Unit 00:Verified 00/wudata_01.trr
01:56:01:Unit 00:Verified 00/wudata_01.xtc
01:56:01:Unit 00:Verified 00/wudata_01.edr
01:56:06:Unit 00:Completed 89000 out of 250000 steps  (35%)
01:56:06:Unit 00:Gromacs cannot continue further.
01:56:06:Unit 00:Going to send back what have done -- stepsTotalG=250000
01:56:06:Unit 00:Work fraction=0.3560 steps=250000.
01:56:10:Unit 00:logfile size=67847 infoLength=67847 edr=0 trr=23
01:56:10:Unit 00:logfile size: 67847 info=67847 bed=0 hdr=23
01:56:11:Unit 00:- Writing 68383 bytes of core data to disk...
01:56:26:Unit 00:Done: 67871 -> 9517 (compressed to 14.0 percent)
01:56:26:Unit 00:  ... Done.
01:57:04:Unit 01:Working on Protein
01:57:05:Unit 01:Client config unavailable.
01:57:06:Unit 01:Starting GUI Server
01:57:16:Unit 01:Resuming from checkpoint
01:57:16:Unit 01:fcCheckPointResume: retreived and current tpr file hash:
01:57:16:Unit 01:   0   2467527010   2467527010
01:57:16:Unit 01:   1   3895228313   3895228313
01:57:16:Unit 01:   2   3608002700   3608002700
01:57:16:Unit 01:   3    918698187    918698187
01:57:16:Unit 01:   4    886431637    886431637
01:57:16:Unit 01:Verified 01/wudata_01.log
01:57:16:Unit 01:Verified 01/wudata_01.edr
01:57:16:Unit 01:Verified 01/wudata_01.xtc
01:57:16:Unit 01:Completed 20%
01:57:34:FahCore running Unit 00 returned: UNSTABLE_MACHINE (122)
01:57:34:Starting Unit 00
01:57:34:Running core: "C:/Documents and Settings/All Users/Application Data/FAHClient/cores/www.stanford.edu/~pande/Win32/x86/Core_a6.fah/FahCore_a6.exe" -dir 00 -suffix 01 -lifeline 5288 -version 701 -checkpoint 15 -forceasm
01:57:34:Started core on PID 4332
01:57:34:FahCore 0xa6 started
01:57:35:FahCore running Unit 00 returned: MISSING_WORK_FILES (116)
01:57:35:WARNING: Unit 00 Fatal error, dumping
01:57:36:Sending unit results: id:00 state:SEND project:3866 run:2597 clone:0 gen:0 core:0xa6 unit:0x00000002000000654dd9e8cc98da84f5
01:57:36:Unit 00: Uploading 9.79KiB
01:57:36:Connecting to assign3.stanford.edu:8080
01:57:36:Connecting to 128.143.48.226:8080
01:57:36:News: Welcome to Folding@Home
01:57:36:Assigned to work server 128.143.48.226
01:57:36:Requesting new work unit for slot 00: READY uniprocessor from 128.143.48.226
01:57:36:Connecting to 128.143.48.226:8080
01:57:37:Unit 00: Upload complete
01:57:37:Server responded UNKNOWN_ENUM (575)
01:57:37:WARNING: Failed to send results, will try again later
01:57:37:Sending unit results: id:00 state:SEND project:3866 run:2597 clone:0 gen:0 core:0xa6 unit:0x00000002000000654dd9e8cc98da84f5
01:57:38:Slot 00: Downloading 1.42MiB
01:57:38:Unit 00: Uploading 9.79KiB
01:57:38:Connecting to 128.143.48.226:8080
01:57:38:Unit 00: Upload complete
01:57:38:Server responded WORK_ACK (400)
01:57:39:Cleaning up Unit 00
01:57:41:Slot 00: Download complete
01:57:41:Received Unit: id:02 state:DOWNLOAD project:3865 run:9337 clone:0 gen:1 core:0xa6 unit:0x00000001000000654ddad1eacf71e6f8
01:57:41:Starting Unit 02
01:57:41:Running core: "C:/Documents and Settings/All Users/Application Data/FAHClient/cores/www.stanford.edu/~pande/Win32/x86/Core_a6.fah/FahCore_a6.exe" -dir 02 -suffix 01 -lifeline 5288 -version 701 -checkpoint 15 -forceasm
01:57:41:Started core on PID 4284
01:57:41:FahCore 0xa6 started
01:57:43:Unit 02:
01:57:43:Unit 02:*------------------------------*
01:57:43:Unit 02:Folding@Home Gromacs Core
01:57:43:Unit 02:Version 2.28 (Wed Mar 23 13:51:17 PDT 2011)
01:57:43:Unit 02:
01:57:43:Unit 02:Preparing to commence simulation
01:57:43:Unit 02:- Assembly optimizations manually forced on.
01:57:43:Unit 02:- Not checking prior termination.
01:57:44:Unit 02:- Expanded 1488276 -> 2415376 (decompressed 162.2 percent)
01:57:44:Unit 02:Called DecompressByteArray: compressed_data_size=1488276 data_size=2415376, decompressed_data_size=2415376 diff=0
01:57:44:Unit 02:- Digital signature verified
01:57:44:Unit 02:
01:57:44:Unit 02:Project: 3865 (Run 9337, Clone 0, Gen 1)
01:57:44:Unit 02:
01:57:45:Unit 02:Assembly optimizations on if available.
01:57:45:Unit 02:Entering M.D.
01:57:51:Unit 02:Mapping NT from 1 to 1 
01:57:56:Unit 02:Completed 0 out of 250000 steps  (0%)
02:06:59:Unit 01:Completed 21%

Re: Project: 3866 (Run 2597, Clone 0, Gen 0)

Posted: Sun Jun 12, 2011 4:38 am
by PantherX
I performed a look-up on the WU Project: 3866 (Run 2597, Clone 0, Gen 0) and there are two reports in the WU Database:
Your WU (P3866 R2597 C0 G0) was added to the stats database on 2011-06-09 15:07:12 for 0 points of credit.
Your WU (P3866 R2597 C0 G0) was added to the stats database on 2011-06-10 08:07:30 for 0 points of credit.
I have marked it for a follow-up.

Re: Project: 3866 (Run 2597, Clone 0, Gen 0)

Posted: Wed Jun 15, 2011 4:20 am
by mrshirts
So, this is dying only on startup, correct?

Re: Project: 3866 (Run 2597, Clone 0, Gen 0)

Posted: Wed Jun 15, 2011 4:20 am
by mrshirts
By "startup" I mean restarting on checkpoint.

Re: Project: 3866 (Run 2597, Clone 0, Gen 0)

Posted: Wed Jun 15, 2011 4:44 am
by GreyWhiskers
That's what happened in my case - it was processing fine until I paused the slot, then exited the FAH Client control so I could reboot my system after almost two weeks of uptime. Then, my report above is what happened when the slot was restarted upon reboot.

Re: Project: 3866 (Run 2597, Clone 0, Gen 0)

Posted: Thu Jun 16, 2011 3:40 am
by mrshirts
So, Client 7, on Windows XP. I don't think we've tested that combination, so that might be the reason. Is this replicatable? Can if you get another WU (near the beginning, hopefully!) and see if it crashes, that would be very valuable!

Re: Project: 3866 (Run 2597, Clone 0, Gen 0)

Posted: Thu Jun 16, 2011 6:23 am
by GreyWhiskers
After the "Gromacs cannot continue further." error noted in my log above, Project: 3865 (Run 9337, Clone 0, Gen 1) was started. I'm 51% complete - with a TPF of one hour 25+ minutes. My current projection is that it will not make the 7 day preferred deadline, but I think I will let it run to completion.

So - I don't plan to reboot the system to see what will happen with a new WU yet.

BTW, this WU is getting maybe 41-42 ppd estimates, about half what this CPU gets with other "normal" work units like p6509, p6517, p6521. The only time I've seen the processing ON THIS CPU run so slow is if the SSE Extensions are not enabled. Are we sure that the core A6 enables the "Extra SSE boost OK" like I see reported at the start of each core 78 WU? Regardless of the verbosity level, I've not seen the a6 core report this at the beginning of a WU.

Re: Project: 3866 (Run 2597, Clone 0, Gen 0)

Posted: Thu Jun 16, 2011 3:35 pm
by GreyWhiskers
Unexpected happenings. Normal shutdown and restart of core a6 WU last night after reboot. :biggrin:

The computer that has been running this core a6 had another reboot last night - to load some 20 Windows updates. The computer shut itself down this time, including the FAH Control/Client. This differs from the case I describe at the beginning of this thread where I carefully paused the running cores in the FAH GUI, then manually exited the GUI.

Maybe just pulling the plug works better than careful shutdone. :e?: :eo

I've included 3 log snippets

- the end of one log when windows forced a reboot (10:15:31:Lost lifeline PID 4584, exiting)
- The beginning when Windows came up and the FAH system auto started from the shortcut in START menu, with normal restart of 12:43:27:Unit 02:Project: 3865 (Run 9337, Clone 0, Gen 1)
- Historical log showing the termination prior to the event that aborted Project: 3866 (Run 2597, Clone 0, Gen 0).



End of log including windows-induced shutdown as Windows was preparing to reboot. Note no gentle shutdown.

Code: Select all

09:04:09:Unit 04:Completed 1%
09:15:12:Unit 04:Completed 2%
09:23:50:Unit 02:Completed 132500 out of 250000 steps  (53%)
09:24:49:Unit 04:Completed 3%
09:34:20:Unit 04:Completed 4%
09:44:34:Unit 04:Completed 5%
09:53:28:Unit 04:Completed 6%
10:00:13:Unit 04:Completed 7%
10:06:55:Unit 04:Completed 8%
10:13:36:Unit 04:Completed 9%
10:15:31:Lost lifeline PID 4584, exiting

Log for startup after reboot.

Code: Select all

*********************** Log Started 16/Jun/2011-12:42:45 ***********************
12:42:45:************************* Folding@home Client *************************
12:42:45:      Website: http://folding.stanford.edu/
12:42:45:    Copyright: (c) 2009,2010 Stanford University
12:42:45:       Author: Joseph Coffland <joseph@cauldrondevelopment.com>
12:42:45:         Args: --lifeline 4124
12:42:45:       Config: C:/Documents and Settings/All Users/Application
12:42:45:               Data/FAHClient/config.xml
12:42:45:******************************** Build ********************************
12:42:45:      Version: 7.1.21
12:42:45:         Date: Mar 23 2011
12:42:45:         Time: 15:13:48
12:42:45:      SVN Rev: 2883
12:42:45:       Branch: fah/trunk/client
12:42:45:     Compiler: Intel(R) C++ MSVC 1500 mode 1110
12:42:45:      Options: /TP /nologo /EHa /wd4297 /wd4103 /wd1786 /Ox -arch:SSE
12:42:45:               /QaxSSE2,SSE3,SSSE3,SSE4.1,SSE4.2 /Qrestrict /MT
12:42:45:     Platform: win32 XP
12:42:45:         Bits: 32
12:42:45:         Mode: Release
12:42:45:******************************* System ********************************
12:42:45:           OS: Microsoft Windows XP Home Edition
12:42:45:          CPU: Intel(R) Pentium(R) 4 CPU 3.20GHz
12:42:45:       CPU ID: GenuineIntel Family 15 Model 2 Stepping 9
12:42:45:         CPUs: 2
12:42:45:       Memory: 2.00GiB
12:42:45:  Free Memory: 1.04GiB
12:42:45:      Threads: WINDOWS_THREADS
12:42:45:         GPUs: 2
12:42:45:        GPU 0: ATI:2 RV730 Pro AGP [Radeon HD 4600 Series]
12:42:45:        GPU 1: RV710/730
12:42:45:         CUDA: Not detected
12:42:45:   On Battery: false
12:42:45:   UTC offset: -7
12:42:45:          PID: 5276
12:42:45:          CWD: C:/Documents and Settings/All Users/Application Data/FAHClient
12:42:45:Win32 Service: false
12:42:45:***********************************************************************
12:42:45:<config>
12:42:45:  <service-description v='Folding@home Client'/>
12:42:45:  <service-restart v='true'/>
12:42:45:  <service-restart-delay v='5000'/>
12:42:45:
12:42:45:  <!-- Client Control -->
12:42:45:  <cycle-rate v='4'/>
12:42:45:  <cycles v='-1'/>
12:42:45:  <data-directory v='.'/>
12:42:45:  <exec-directory v='C:\Program Files\FAHClient'/>
12:42:45:  <exit-when-done v='false'/>
12:42:45:  <max-delay v='21600'/>
12:42:45:  <min-delay v='60'/>
12:42:45:  <threads v='4'/>
12:42:45:
12:42:45:  <!-- Configuration -->
12:42:45:  <config-rotate v='true'/>
12:42:45:  <config-rotate-dir v='configs'/>
12:42:45:  <config-rotate-max v='16'/>
12:42:45:
12:42:45:  <!-- Debugging -->
12:42:45:  <assignment-servers>
12:42:45:    assign3.stanford.edu:8080 assign4.stanford.edu:80
12:42:45:  </assignment-servers>
12:42:45:  <capture-directory v='capture'/>
12:42:45:  <capture-sockets v='false'/>
12:42:45:  <debug-sockets v='false'/>
12:42:45:  <exception-locations v='true'/>
12:42:45:  <gpu-assignment-servers>
12:42:45:    assign-GPU.stanford.edu:80 assign-GPU.stanford.edu:8080
12:42:45:  </gpu-assignment-servers>
12:42:45:  <stack-traces v='false'/>
12:42:45:
12:42:45:  <!-- Error Handling -->
12:42:45:  <max-slot-errors v='5'/>
12:42:45:  <max-unit-errors v='5'/>
12:42:45:
12:42:45:  <!-- FahCore Control -->
12:42:45:  <checkpoint v='15'/>
12:42:45:  <core-dir v='cores'/>
12:42:45:  <core-priority v='idle'/>
12:42:45:  <cpu-affinity v='false'/>
12:42:45:  <cpu-usage v='100'/>
12:42:45:  <no-assembly v='false'/>
12:42:45:
12:42:45:  <!-- Folding Slot Configuration -->
12:42:45:  <client-subtype v='STDCLI'/>
12:42:45:  <client-type v='normal'/>
12:42:45:  <cpu-species v='UNKNOWN'/>
12:42:45:  <cpu-type v='X86'/>
12:42:45:  <cpus v='2'/>
12:42:45:  <extra-core-args v='-forceasm'/>
12:42:45:  <gpu v='false'/>
12:42:45:  <gpu-id v='0'/>
12:42:45:  <max-packet-size v='normal'/>
12:42:45:  <os-species v='WIN_XP'/>
12:42:45:  <os-type v='WIN32'/>
12:42:45:  <project-key v='0'/>
12:42:45:  <smp v='true'/>
12:42:45:
12:42:45:  <!-- Logging -->
12:42:45:  <log v='log.txt'/>
12:42:45:  <log-color v='false'/>
12:42:45:  <log-crlf v='true'/>
12:42:45:  <log-date v='false'/>
12:42:45:  <log-debug v='true'/>
12:42:45:  <log-domain v='false'/>
12:42:45:  <log-header v='true'/>
12:42:45:  <log-level v='true'/>
12:42:45:  <log-no-info-header v='true'/>
12:42:45:  <log-redirect v='false'/>
12:42:45:  <log-rotate v='true'/>
12:42:45:  <log-rotate-dir v='logs'/>
12:42:45:  <log-rotate-max v='16'/>
12:42:45:  <log-short-level v='false'/>
12:42:45:  <log-simple-domains v='true'/>
12:42:45:  <log-thread-id v='false'/>
12:42:45:  <log-time v='true'/>
12:42:45:  <log-to-screen v='true'/>
12:42:45:  <log-truncate v='false'/>
12:42:45:  <verbosity v='4'/>
12:42:45:
12:42:45:  <!-- Process Control -->
12:42:45:  <child v='false'/>
12:42:45:  <daemon v='false'/>
12:42:45:  <pid v='false'/>
12:42:45:  <pid-file v='Folding@home Client.pid'/>
12:42:45:  <respawn v='false'/>
12:42:45:  <service v='false'/>
12:42:45:
12:42:45:  <!-- Remote Command Server -->
12:42:45:  <command-address v='0.0.0.0'/>
12:42:45:  <command-allow v='127.0.0.1'/>
12:42:45:  <command-allow-no-pass v='127.0.0.1'/>
12:42:45:  <command-deny v='0.0.0.0/0'/>
12:42:45:  <command-deny-no-pass v='0.0.0.0/0'/>
12:42:45:  <command-port v='36330'/>
12:42:45:  <password v=''/>
12:42:45:
12:42:45:  <!-- Slot Control -->
12:42:45:  <max-shutdown-wait v='60'/>
12:42:45:  <pause-on-battery v='false'/>
12:42:45:  <pause-on-start v='false'/>
12:42:45:
12:42:45:  <!-- User Information -->
12:42:45:  <machine-id v='0'/>
12:42:45:  <passkey v='********************************'/>
12:42:45:  <team v='0'/>
12:42:45:  <user v='GreyWhiskers'/>
12:42:45:
12:42:45:  <!-- Work Unit Control -->
12:42:45:  <dump-after-deadline v='true'/>
12:42:45:  <max-queue v='16'/>
12:42:45:  <max-units v='0'/>
12:42:45:  <next-unit-percentage v='99'/>
12:42:45:
12:42:45:  <!-- Folding Slots -->
12:42:45:  <slot id='1' type='GPU'>
12:42:45:    <client-type v='advanced'/>
12:42:45:  </slot>
12:42:45:  <slot id='0' type='UNIPROCESSOR'>
12:42:45:    <core-priority v='low'/>
12:42:45:  </slot>
12:42:45:</config>
12:42:56:Enabled folding slot 01: READY gpu:0:"RV730 Pro AGP [Radeon HD 4600 Series]"
12:42:56:Enabled folding slot 00: READY uniprocessor
12:42:57:Starting Unit 02
12:42:57:Running core: "C:/Documents and Settings/All Users/Application Data/FAHClient/cores/www.stanford.edu/~pande/Win32/x86/Core_a6.fah/FahCore_a6.exe" -dir 02 -suffix 01 -lifeline 5276 -version 701 -checkpoint 15 -forceasm
12:42:58:Server connection id=1 on 0.0.0.0:36330 from 127.0.0.1:1072
12:43:08:Server connection id=2 on 0.0.0.0:36330 from 127.0.0.1:1073
12:43:13:Started core on PID 2804
12:43:13:FahCore 0xa6 started
12:43:15:Starting Unit 04
12:43:15:Running core: "C:/Documents and Settings/All Users/Application Data/FAHClient/cores/www.stanford.edu/~pande/Win32/x86/ATI/R600/Core_11.fah/FahCore_11.exe" -dir 04 -suffix 01 -lifeline 5276 -version 701 -checkpoint 15 -gpu 0 -forceasm
12:43:15:Unit 02:
12:43:15:Unit 02:*------------------------------*
12:43:15:Unit 02:Folding@Home Gromacs Core
12:43:15:Unit 02:Version 2.28 (Wed Mar 23 13:51:17 PDT 2011)
12:43:15:Unit 02:
12:43:15:Started core on PID 5452
12:43:15:Unit 02:Preparing to commence simulation
12:43:15:FahCore 0x11 started
12:43:15:Unit 02:- Ensuring status. Please wait.
12:43:17:Unit 04:
12:43:17:Unit 04:*------------------------------*
12:43:17:Unit 04:Folding@Home GPU Core - Beta
12:43:18:Unit 04:Version 1.24 (Mon Feb 9 11:00:12 PST 2009)
12:43:18:Unit 04:
12:43:18:Unit 04:Compiler  : Microsoft (R) 32-bit C/C++ Optimizing Compiler Version 14.00.50727.762 for 80x86 
12:43:18:Unit 04:Build host: amoeba
12:43:18:Unit 04:Board Type: AMD
12:43:18:Unit 04:Core      : 
12:43:18:Unit 04:Preparing to commence simulation
12:43:18:Unit 04:- Assembly optimizations manually forced on.
12:43:18:Unit 04:- Not checking prior termination.
12:43:18:Unit 04:- Expanded 98694 -> 492188 (decompressed 498.7 percent)
12:43:18:Unit 04:Called DecompressByteArray: compressed_data_size=98694 data_size=492188, decompressed_data_size=492188 diff=0
12:43:18:Unit 04:- Digital signature verified
12:43:18:Unit 04:
12:43:18:Unit 04:Project: 5733 (Run 4, Clone 583, Gen 443)
12:43:18:Unit 04:
12:43:19:Unit 04:Assembly optimizations on if available.
12:43:19:Unit 04:Entering M.D.
12:43:24:Unit 02:- Assembly optimizations manually forced on.
12:43:24:Unit 02:- Not checking prior termination.
12:43:25:Unit 04:Will resume from checkpoint file
12:43:25:Unit 04:Tpr hash 04/wudata_01.tpr:  1804990054 398131752 4175671175 1453718808 228416442
12:43:27:Unit 02:- Expanded 1488276 -> 2415376 (decompressed 162.2 percent)
12:43:27:Unit 02:Called DecompressByteArray: compressed_data_size=1488276 data_size=2415376, decompressed_data_size=2415376 diff=0
12:43:27:Unit 02:- Digital signature verified
12:43:27:Unit 02:
12:43:27:Unit 02:Project: 3865 (Run 9337, Clone 0, Gen 1)
12:43:27:Unit 02:
12:43:29:Unit 02:Assembly optimizations on if available.
12:43:29:Unit 02:Entering M.D.
12:43:35:Unit 02:Using Gromacs checkpoints
12:43:37:Unit 02:Mapping NT from 1 to 1 
12:43:39:Unit 04:Working on Protein
12:43:40:Unit 04:Client config unavailable.
12:43:41:Unit 04:Starting GUI Server
12:43:44:Unit 02:Resuming from checkpoint
12:43:45:Unit 02:Verified 02/wudata_01.log
12:43:46:Unit 02:Verified 02/wudata_01.trr
12:43:46:Unit 02:Verified 02/wudata_01.xtc
12:43:47:Unit 02:Verified 02/wudata_01.edr
12:43:53:Unit 02:Completed 133600 out of 250000 steps  (53%)
12:43:54:Unit 04:Resuming from checkpoint
12:43:54:Unit 04:fcCheckPointResume: retreived and current tpr file hash:
12:43:54:Unit 04:   0   1804990054   1804990054
12:43:54:Unit 04:   1    398131752    398131752
12:43:54:Unit 04:   2   4175671175   4175671175
12:43:54:Unit 04:   3   1453718808   1453718808
12:43:54:Unit 04:   4    228416442    228416442
12:43:55:Unit 04:Verified 04/wudata_01.log
12:43:55:Unit 04:Verified 04/wudata_01.edr
12:43:55:Unit 04:Verified 04/wudata_01.xtc
12:43:55:Unit 04:Completed 9%
12:44:05:Server connection id=3 on 0.0.0.0:36330 from 127.0.0.1:1076
12:44:20:Server connection id=4 on 0.0.0.0:36330 from 192.168.10.193:4527
12:50:48:Unit 04:Completed 10%
12:52:31:Server connection id=5 on 0.0.0.0:36330 from 192.168.10.193:4543
12:57:32:Unit 04:Completed 11%
13:04:10:Unit 04:Completed 12%
13:10:50:Unit 04:Completed 13%
13:17:31:Unit 04:Completed 14%
13:24:11:Unit 04:Completed 15%
13:30:51:Unit 04:Completed 16%
13:33:01:Unit 02:Completed 135000 out of 250000 steps  (54%)
13:37:31:Unit 04:Completed 17%
13:44:09:Unit 04:Completed 18%
13:50:49:Unit 04:Completed 19%
13:57:26:Unit 04:Completed 20%
14:03:13:Server connection id=6 on 0.0.0.0:36330 from 192.168.10.193:4621
14:04:07:Unit 04:Completed 21%
14:10:48:Unit 04:Completed 22%
14:17:25:Unit 04:Completed 23%
14:24:10:Unit 04:Completed 24%
14:30:54:Unit 04:Completed 25%
14:37:47:Unit 04:Completed 26%
14:44:28:Unit 04:Completed 27%
14:51:24:Unit 04:Completed 28%
Historical data: shutdown of system prior to the PROJECT: 3866 (RUN 2597, CLONE 0, GEN 0) failure at top of thread. This was careful "manual" shutdown through the GUI. Nothing unexpected that I can see.

Code: Select all

00:32:24:Unit 00:Completed 87500 out of 250000 steps  (35%)
00:38:51:Unit 01:Completed 13%
00:45:28:Unit 01:Completed 14%
00:52:08:Unit 01:Completed 15%
00:58:47:Unit 01:Completed 16%
01:05:23:Unit 01:Completed 17%
01:12:03:Unit 01:Completed 18%
01:18:42:Unit 01:Completed 19%
01:25:18:Unit 01:Completed 20%
01:27:02:Slot 01 paused
01:27:02:Slot 00 paused
01:27:02:Slot 01: shutting core down
01:27:02:WARNING: FahCore not accepting gentle shutdown, killing
01:27:02:WARNING: Killing Unit 01
01:27:02:Slot 00: shutting core down
01:27:02:WARNING: FahCore not accepting gentle shutdown, killing
01:27:02:WARNING: Killing Unit 00
01:27:02:FahCore running Unit 00 exited
01:27:03:FahCore running Unit 01 exited
01:27:22:Lost lifeline PID 3280, exiting
01:27:23:Clean exit

Re: Project: 3866 (Run 2597, Clone 0, Gen 0)

Posted: Thu Jun 16, 2011 4:54 pm
by bruce
Project: 3866 (Run 2597, Clone 0, Gen 0) has been successfully completed by someone else. I think it's fair to assume that FahCore_a6 does not resume correctly from a checkpoint. Unfortunately this probably means that the projects should be returned to beta testing until a new version of core a6 can correct this issue. Mrshirts, what do you think?

Re: Project: 3866 (Run 2597, Clone 0, Gen 0)

Posted: Thu Jun 16, 2011 5:05 pm
by mrshirts
Hi, Bruce-

Right now, the problem is that it usually returns correctly from a checkpoint. It appears to occasionally fail, but in beta, we couldn't collect enough information to actually figure out when it fails. It might be best to pull it back to advanced, however - that will presumably allow us to figure out where the failures are occurring.

From GreyWhisker's data, it looks like it correctly restarted from checkpoint the second time, correct? This makes it even more confusing.

It might end up being best to pull back to advanced; this way, we are more likely to get enough data to catch the rare bugs.

I am concerned about the ppd and the speed. SSE should be working -- it was working on the versions tested locally. As you can see from the log, it was compiled with SSE. I'll have to investigate why that message is not being printed. The message is being printed for V7 client with the core 78 WU, correct?

Re: Project: 3866 (Run 2597, Clone 0, Gen 0)

Posted: Thu Jun 16, 2011 5:24 pm
by bruce
mrshirts wrote:I am concerned about the ppd and the speed. SSE should be working -- it was working on the versions tested locally. As you can see from the log, it was compiled with SSE. I'll have to investigate why that message is not being printed. The message is being printed for V7 client with the core 78 WU, correct?
The logic in Core_78 may be the exact issue here. V7 is shutting down cores in a way that disables SSE in core_78 on the next run. Please read the V7 tickets associated with the keyword forceasm (including closed tickets).

Re: Project: 3866 (Run 2597, Clone 0, Gen 0)

Posted: Thu Jun 16, 2011 6:30 pm
by GreyWhiskers
I think I've made a mess of this thread, but the WU that early this morning successfully restarted from checkpoint after reboot is Project: 3865 (Run 9337, Clone 0, Gen 1).

It's going very slowly - my last frame took 1 hour 24 minutes, and is now at 57%. That would put its completion at ~ 9 or 10 hours (conservatively) after the 7 day preferred deadline. :roll:

Re: Project: 3866 (Run 2597, Clone 0, Gen 0)

Posted: Thu Jun 16, 2011 6:55 pm
by mrshirts
I don't think SSE is the issue here, because -forceasm is being used. The line
01:55:49:Unit 01:- Assembly optimizations manually forced on.
Indicates that any problem there is being ignored.

If it IS the issue deleting the core.dat file should eliminate this; that's the file that stores that information. But I don't think that's the issues.
I'll be looking into the PPD, because this WU is not supposed to be low for uniprocessor.

Re: Project: 3866 (Run 2597, Clone 0, Gen 0)

Posted: Thu Jun 16, 2011 8:24 pm
by GreyWhiskers
mrshirts wrote:Hi, Bruce-

Right now, the problem is that it usually returns correctly from a checkpoint. It appears to occasionally fail, but in beta, we couldn't collect enough information to actually figure out when it fails. It might be best to pull it back to advanced, however - that will presumably allow us to figure out where the failures are occurring.

From GreyWhisker's data, it looks like it correctly restarted from checkpoint the second time, correct? This makes it even more confusing.

It might end up being best to pull back to advanced; this way, we are more likely to get enough data to catch the rare bugs.

I am concerned about the ppd and the speed. SSE should be working -- it was working on the versions tested locally. As you can see from the log, it was compiled with SSE. I'll have to investigate why that message is not being printed. The message is being printed for V7 client with the core 78 WU, correct?
Here's a snippet from a random Core 78 that is in the log for my other uniprocessor computer. With -forceasm on, both of these lines are always present for the Core 78s:
10:57:13:Unit 00:Assembly optimizations on if available.
10:57:20:Unit 00:Extra SSE boost OK.

Code: Select all

10:57:12:Running core: "C:/Documents and Settings/Admin/Application Data/FAHClient/cores/www.stanford.edu/~pande/Win32/x86/Core_78.fah/FahCore_78.exe" -dir 00 -suffix 01 -lifeline 27836 -version 701 -checkpoint 15 -forceasm
10:57:12:Connecting to 171.64.65.62:8080
10:57:12:Started core on PID 37272
10:57:12:FahCore 0x78 started
10:57:13:Unit 00:
10:57:13:Unit 00:*------------------------------*
10:57:13:Unit 00:Folding@Home Gromacs Core
10:57:13:Unit 00:Version 1.90 (March 8, 2006)
10:57:13:Unit 00:
10:57:13:Unit 00:Preparing to commence simulation
10:57:13:Unit 00:- Assembly optimizations manually forced on.
10:57:13:Unit 00:- Not checking prior termination.
10:57:13:Unit 00:- Expanded 499895 -> 2506673 (decompressed 501.4 percent)
10:57:13:Unit 00:- Starting from initial work packet
10:57:13:Unit 00:
10:57:13:Unit 00:Project: 6515 (Run 15, Clone 228, Gen 77)
10:57:13:Unit 00:
10:57:13:Unit 00:Assembly optimizations on if available.
10:57:13:Unit 00:Entering M.D.
10:57:14:Unit 01: Upload complete
10:57:14:Server responded WORK_ACK (400)
10:57:15:Cleaning up Unit 01
10:57:20:Unit 00:Protein: 1CFC_A_16 in water
10:57:20:Unit 00:
10:57:20:Unit 00:Writing local files
10:57:20:Unit 00:Extra SSE boost OK.
10:57:20:Unit 00:Writing local files
10:57:20:Unit 00:Completed 0 out of 250000 steps  (0%)
11:07:01:Unit 00:Writing local files
11:07:01:Unit 00:Completed 2500 out of 250000 steps  (1%)
In the core a6s, if the "Extra SSE boost OK" is enabled, it doesn't seem to be put in the log. [I've edited out the interleaved Core 11 GPU lines to cut down on the confusion]
12:43:24:Unit 02:- Assembly optimizations manually forced on.
12:43:24:Unit 02:- Not checking prior termination.
12:43:27:Unit 02:- Expanded 1488276 -> 2415376 (decompressed 162.2 percent)
12:43:27:Unit 02:Called DecompressByteArray: compressed_data_size=1488276 data_size=2415376, decompressed_data_size=2415376 diff=0
12:43:27:Unit 02:- Digital signature verified
12:43:27:Unit 02:
12:43:27:Unit 02:Project: 3865 (Run 9337, Clone 0, Gen 1)
12:43:27:Unit 02:
12:43:29:Unit 02:Assembly optimizations on if available.
12:43:29:Unit 02:Entering M.D.
12:43:35:Unit 02:Using Gromacs checkpoints
12:43:37:Unit 02:Mapping NT from 1 to 1
12:43:44:Unit 02:Resuming from checkpoint
12:43:45:Unit 02:Verified 02/wudata_01.log
12:43:46:Unit 02:Verified 02/wudata_01.trr
12:43:46:Unit 02:Verified 02/wudata_01.xtc
12:43:47:Unit 02:Verified 02/wudata_01.edr
12:43:53:Unit 02:Completed 133600 out of 250000 steps (53%)
13:33:01:Unit 02:Completed 135000 out of 250000 steps (54%)

Re: Project: 3866 (Run 2597, Clone 0, Gen 0)

Posted: Fri Jun 17, 2011 2:43 am
by SomeStones
This series does take a long time - from 44 hours for my 2 GHz dual-core laptop to 77 hours for my 3.0 GHz P4 to 110 hours for my 2.0 GHz Sempron. None report SSE but all report "Assembly optimizations on if available." 333 points seems a little low but better than some others.