Page 1 of 6

Merged problems with projects 6903/6904, Part 1

Posted: Sat Jan 28, 2012 9:16 pm
by Grandpa_01
You may want to keep an eye on this one this rig has not failed a WU in ages.Also take a look at the log v7 was capable of sending 2 WU's and downloading 1 at the same time and it appears there was no problem. It looks like Stanford got both WU's the 6901 and the failed 6903 while downloading a 6904. Now how often will that happen 3 transmissions at 1 time with F@H 8-)

Code: Select all

19:16:44:Starting Unit 00
19:16:44:Running core: /var/lib/fahclient/cores/www.stanford.edu/~pande/Linux/AMD64/beta/Core_a5.fah/FahCore_a5 -dir 00 -suffix 01 -lifeline 1094 -version 701 -checkpoint 15 -np 12
19:16:44:Started core on PID 2181
19:16:44:FahCore 0xa5 started
19:16:44:FahCore, running Unit 00, returned: FILE_IO_ERROR (117 = 0x75)
[93m19:16:44:WARNING: Unit 00 Fatal error, dumping[0m
19:16:44:Sending unit results: id:00 state:SEND error:DUMPED project:6903 run:6 clone:2 gen:75 core:0xa5 unit:0x0000006852be746d4de924152624a1d8
19:16:44:Connecting to 130.237.232.237:8080
19:16:45:Connecting to assign3.stanford.edu:8080
19:16:45:Server responded WORK_ACK (400)
19:16:45:Cleaning up Unit 00

Code: Select all

18:58:09:Unit 01:Completed 247500 out of 250000 steps  (99%)
18:58:10:Connecting to assign3.stanford.edu:8080
18:58:10:News: Welcome to Folding@Home
18:58:10:Assigned to work server 130.237.232.237
18:58:10:Requesting new work unit for slot 00: RUNNING smp:12 from 130.237.232.237
18:58:10:Connecting to 130.237.232.237:8080
18:58:11:Slot 00: Downloading 512B
18:58:11:Slot 00: Download complete
18:58:11:Received Unit: id:00 state:DOWNLOAD error:OK project:6903 run:6 clone:2 gen:75 core:0xa5 unit:0x0000006852be746d4de924152624a1d8
19:16:12:Unit 01:Completed 250000 out of 250000 steps  (100%)
19:16:20:Unit 01:DynamicWrapper: Finished Work Unit: sleep=10000
19:16:30:Unit 01:
19:16:30:Unit 01:Finished Work Unit:
19:16:30:Unit 01:- Reading up to 52713120 from "01/wudata_01.trr": Read 52713120
19:16:30:Unit 01:trr file hash check passed.
19:16:30:Unit 01:- Reading up to 47028628 from "01/wudata_01.xtc": Read 47028628
19:16:30:Unit 01:xtc file hash check passed.
19:16:30:Unit 01:edr file hash check passed.
19:16:30:Unit 01:logfile size: 203005
19:16:30:Unit 01:Leaving Run
19:16:33:Unit 01:- Writing 100114701 bytes of core data to disk...
19:16:34:Unit 01:  ... Done.
19:16:43:Unit 01:- Shutting down core
19:16:43:Unit 01:
19:16:43:Unit 01:Folding@home Core Shutdown: FINISHED_UNIT
19:16:44:FahCore, running Unit 01, returned: FINISHED_UNIT (100 = 0x64)
19:16:44:Sending unit results: id:01 state:SEND error:OK project:6901 run:16 clone:4 gen:98 core:0xa5 unit:0x0000007952be746d4d5b0476e47cbcb3
19:16:44:Unit 01: Uploading 95.48MiB to 130.237.232.237
19:16:44:Connecting to 130.237.232.237:8080
19:16:44:Starting Unit 00
19:16:44:Running core: /var/lib/fahclient/cores/www.stanford.edu/~pande/Linux/AMD64/beta/Core_a5.fah/FahCore_a5 -dir 00 -suffix 01 -lifeline 1094 -version 701 -checkpoint 15 -np 12
19:16:44:Started core on PID 2181
19:16:44:FahCore 0xa5 started
19:16:44:FahCore, running Unit 00, returned: FILE_IO_ERROR (117 = 0x75)
[93m19:16:44:WARNING: Unit 00 Fatal error, dumping[0m
19:16:44:Sending unit results: id:00 state:SEND error:DUMPED project:6903 run:6 clone:2 gen:75 core:0xa5 unit:0x0000006852be746d4de924152624a1d8
19:16:44:Connecting to 130.237.232.237:8080
19:16:45:Connecting to assign3.stanford.edu:8080
19:16:45:Server responded WORK_ACK (400)
19:16:45:Cleaning up Unit 00
19:16:45:News: Welcome to Folding@Home
19:16:45:Assigned to work server 130.237.232.237
19:16:45:Requesting new work unit for slot 00: READY smp:12 from 130.237.232.237
19:16:45:Connecting to 130.237.232.237:8080
19:16:50:Unit 01: 2.06%
19:16:56:Unit 01: 4.61%
19:16:58:Slot 00: Downloading 54.57MiB
19:17:02:Unit 01: 7.00%
19:17:04:Slot 00: 1.40%
19:17:08:Unit 01: 9.55%
19:17:10:Slot 00: 3.70%
19:17:14:Unit 01: 11.81%
19:17:16:Slot 00: 4.86%
19:17:20:Unit 01: 14.13%
19:17:22:Slot 00: 6.05%
19:17:26:Unit 01: 16.10%
19:17:28:Slot 00: 8.01%
19:17:34:Slot 00: 9.95%
19:17:34:Unit 01: 17.62%
19:17:40:Slot 00: 11.84%
19:17:40:Unit 01: 19.72%
19:17:46:Unit 01: 21.29%
19:17:46:Slot 00: 13.70%
19:17:52:Unit 01: 22.85%
19:17:52:Slot 00: 15.64%
19:17:58:Unit 01: 24.40%
19:17:58:Slot 00: 17.58%
19:18:04:Unit 01: 25.95%
19:18:04:Slot 00: 19.51%
19:18:10:Unit 01: 27.52%
19:18:10:Slot 00: 21.45%
19:18:16:Slot 00: 23.27%
19:18:16:Unit 01: 28.71%
19:18:22:Slot 00: 25.21%
19:18:24:Unit 01: 30.53%
19:18:28:Slot 00: 27.14%
19:18:30:Unit 01: 32.71%
19:18:34:Slot 00: 29.08%
19:18:37:Unit 01: 34.10%
19:18:40:Slot 00: 31.02%
19:18:44:Unit 01: 36.06%
19:18:46:Slot 00: 32.95%
19:18:50:Unit 01: 37.90%
19:18:52:Slot 00: 34.77%
19:18:56:Unit 01: 39.43%
19:18:58:Slot 00: 36.71%
19:19:02:Unit 01: 40.99%
19:19:04:Slot 00: 38.64%
19:19:08:Unit 01: 42.55%
19:19:10:Slot 00: 40.58%
19:19:14:Unit 01: 44.10%
19:19:16:Slot 00: 42.41%
19:19:20:Unit 01: 45.64%
19:19:22:Slot 00: 44.34%
19:19:26:Unit 01: 46.85%
19:19:28:Slot 00: 46.27%
19:19:32:Unit 01: 48.76%
19:19:34:Slot 00: 48.21%
19:19:38:Unit 01: 50.31%
19:19:40:Slot 00: 50.15%
19:19:44:Unit 01: 51.87%
19:19:46:Slot 00: 51.97%
19:19:50:Unit 01: 53.43%
19:19:52:Slot 00: 53.91%
19:19:56:Unit 01: 54.99%
19:19:58:Slot 00: 55.84%
19:20:02:Unit 01: 56.50%
19:20:04:Slot 00: 57.78%
19:20:09:Unit 01: 57.82%
19:20:10:Slot 00: 59.71%
19:20:15:Unit 01: 59.92%
19:20:16:Slot 00: 61.65%
19:20:22:Slot 00: 63.47%
19:20:23:Unit 01: 61.39%
19:20:28:Slot 00: 65.41%
19:20:30:Unit 01: 63.35%
19:20:34:Slot 00: 67.34%
19:20:36:Unit 01: 65.37%
19:20:40:Slot 00: 69.28%
19:20:42:Unit 01: 66.92%
19:20:46:Slot 00: 71.22%
19:20:48:Unit 01: 68.47%
19:20:52:Slot 00: 73.10%
19:20:54:Unit 01: 70.04%
19:20:58:Slot 00: 74.97%
19:21:00:Unit 01: 71.59%
19:21:04:Slot 00: 76.91%
19:21:06:Unit 01: 73.13%
19:21:10:Slot 00: 78.85%
19:21:12:Unit 01: 74.37%
19:21:16:Slot 00: 80.78%
19:21:18:Unit 01: 76.27%
19:21:22:Slot 00: 82.72%
19:21:26:Unit 01: 77.72%
19:21:28:Slot 00: 84.59%
19:21:32:Unit 01: 79.72%
19:21:34:Slot 00: 86.48%
19:21:38:Unit 01: 81.46%
19:21:40:Slot 00: 88.41%
19:21:44:Unit 01: 83.02%
19:21:46:Slot 00: 90.35%
19:21:50:Unit 01: 84.58%
19:21:52:Slot 00: 92.29%
19:21:56:Unit 01: 86.11%
19:21:58:Slot 00: 94.22%
19:22:02:Unit 01: 87.68%
19:22:04:Slot 00: 96.04%
19:22:08:Unit 01: 88.61%
19:22:10:Slot 00: 97.98%
19:22:15:Unit 01: 90.55%
19:22:16:Slot 00: 99.97%
19:22:18:Slot 00: Download complete
19:22:18:Received Unit: id:02 state:DOWNLOAD error:OK project:6904 run:1 clone:27 gen:35 core:0xa5 unit:0x0000003352be746d4e15caab179c5071
19:22:18:Starting Unit 02
19:22:18:Running core: /var/lib/fahclient/cores/www.stanford.edu/~pande/Linux/AMD64/beta/Core_a5.fah/FahCore_a5 -dir 02 -suffix 01 -lifeline 1094 -version 701 -checkpoint 15 -np 12
19:22:18:Started core on PID 2187
19:22:18:FahCore 0xa5 started
19:22:18:Unit 02:
19:22:18:Unit 02:*------------------------------*
19:22:18:Unit 02:Folding@Home Gromacs SMP Core
19:22:18:Unit 02:Version 2.27 (Thu Feb 10 09:46:40 PST 2011)
19:22:18:Unit 02:
19:22:18:Unit 02:Preparing to commence simulation
19:22:18:Unit 02:- Looking at optimizations...
19:22:18:Unit 02:- Created dyn
19:22:18:Unit 02:- Files status OK
19:22:21:Unit 01: 92.61%
19:22:21:Unit 02:- Expanded 57215202 -> 71843392 (decompressed 50.5 percent)
19:22:21:Unit 02:Called DecompressByteArray: compressed_data_size=57215202 data_size=71843392, decompressed_data_size=71843392 diff=0
19:22:21:Unit 02:- Digital signature verified
19:22:21:Unit 02:
19:22:21:Unit 02:Project: 6904 (Run 1, Clone 27, Gen 35)
19:22:21:Unit 02:
19:22:21:Unit 02:Assembly optimizations on if available.
19:22:21:Unit 02:Entering M.D.
19:22:28:Unit 02:Mapping NT from 12 to 12 
19:22:29:Unit 01: 94.13%
19:22:34:Unit 02:Completed 0 out of 250000 steps  (0%)
19:22:36:Unit 01: 96.07%
19:22:42:Unit 01: 98.06%
19:22:48:Unit 01: 99.62%
19:22:55:Unit 01: Upload complete
19:22:55:Server responded WORK_ACK (400)
19:22:55:Final credit estimate, 69752.00 points
19:22:55:Cleaning up Unit 01
20:20:56:Unit 02:Completed 2500 out of 250000 steps  (1%)

Re: project:6903 run:6 clone:2 gen:75

Posted: Sun Jan 29, 2012 10:22 am
by PantherX
There's no data in the WU Database yet so I have marked it for follow-up.

BTW, is there anything unique about your system since it displayed this (2 square boxes):
[93m19:16:44:WARNING: Unit 00 Fatal error, dumping[0m

Re: project:6903 run:6 clone:2 gen:75

Posted: Sun Jan 29, 2012 5:18 pm
by Grandpa_01
No nothing unique it is a i7 970 running Ubuntu 10.10 64bit F@H v7 latest

Re: project:6903 run:6 clone:2 gen:75

Posted: Sun Jan 29, 2012 6:50 pm
by PantherX
Okay, thanks. I guess I will just chalk it up to Linux symbol which isn't displayed by the Forum/Windows.

Project: 6904 (Run 2, Clone 14, Gen 51)

Posted: Sat Feb 04, 2012 4:07 pm
by -alias-
Look at this P6904 WU with 13 million steps, which means that 1% will take approximately 60 hours to complete. Is this one of the new bigadv-16 WU?

Code: Select all

[15:21:39] - Files status OK
[15:21:42] - Expanded 46509013 -> 71843392 (decompressed 62.1 percent)
[15:21:42] Called DecompressByteArray: compressed_data_size=46509013 data_size=71843392, decompressed_data_size=71843392 diff=0
[15:21:42] - Digital signature verified
[15:21:42] 
[15:21:42] Project: 6904 (Run 2, Clone 14, Gen 51)
[15:21:42] 
[15:21:42] Entering M.D.
[15:21:48] Using Gromacs checkpoints
[15:21:51] Mapping NT from 12 to 12 
[15:21:58] Resuming from checkpoint
[15:22:02] Verified work/wudata_00.log
[15:22:03] Verified work/wudata_00.trr
[15:22:03] Verified work/wudata_00.xtc
[15:22:03] Verified work/wudata_00.edr
[15:22:05] Completed 5730 out of 13000000 steps  (0%)
I started it and it and let it run for 2 hours without being provided 1% so I stopped it and then start it again so I got to see progress, it's a little frightening to see 13 million steps!

Re: Project: 6904 (Run 2, Clone 14, Gen 51)

Posted: Sat Feb 04, 2012 4:16 pm
by 7im
16 core WUs have't been announced yet, so not likely. And not likely on your 8 core box.

Re: Project: 6904 (Run 2, Clone 14, Gen 51)

Posted: Sat Feb 04, 2012 4:38 pm
by -alias-
No, that's what I thought, but then there must be something wrong with this as it shows a WU with 13000000 steps.

Re: Project: 6904 (Run 2, Clone 14, Gen 51)

Posted: Sat Feb 04, 2012 5:14 pm
by Nathan_P
A 6904 should have 250,000 steps, to me it looks like you are trying to fold the entire 51 Gen series of that PRC. Something somewhere has gone wrong with that series of WU

Re: Project: 6904 (Run 2, Clone 14, Gen 51)

Posted: Sun Feb 05, 2012 2:37 am
by Grandpa_01
Hey I got a 6903 with 10,000,000 steps so we are special we get to do the whole thing. I wonder when it will finsh. Sounds like something has gone wrong somewhere. I ran it for a couple of hours and thought it was hung when I restarted it from a saved copy that is when I noticed the size. Am I supposed to delete this monster

Code: Select all

Note: Please read the license agreement (fah6 -license). Further 
use of this software requires that you have read and accepted this agreement.

48 cores detected


--- Opening Log file [February 5 02:18:56 UTC] 


# Linux SMP Console Edition ###################################################
###############################################################################

                       Folding@Home Client Version 6.34

                          http://folding.stanford.edu

###############################################################################
###############################################################################

Launch directory: /home/rick/fah
Executable: ./fah6
Arguments: -smp -bigbeta 

[02:18:56] - Ask before connecting: No
[02:18:56] - User name: Grandpa (Team 183368)
[02:18:56] - User ID: 577FA1AA1F809D6E
[02:18:56] - Machine ID: 7
[02:18:56] 
[02:18:56] Loaded queue successfully.
[02:18:56] 
[02:18:56] + Processing work unit
[02:18:56] Core required: FahCore_a5.exe
[02:18:56] Core found.
[02:18:56] Working on queue slot 03 [February 5 02:18:56 UTC]
[02:18:56] + Working ...
thekraken: The Kraken 0.6 (compiled Wed Jan 25 04:31:22 PST 2012 by rick@rick-H8QG6)
thekraken: Processor affinity wrapper for Folding@Home
thekraken: The Kraken comes with ABSOLUTELY NO WARRANTY; licensed under GPLv2
thekraken: PID: 2770
thekraken: Logging to thekraken.log
[02:18:57] 
[02:18:57] *------------------------------*
[02:18:57] Folding@Home Gromacs SMP Core
[02:18:57] Version 2.27 (Thu Feb 10 09:46:40 PST 2011)
[02:18:57] 
[02:18:57] Preparing to commence simulation
[02:18:57] - Ensuring status. Please wait.
[02:19:06] - Looking at optimizations...
[02:19:06] - Working with standard loops on this execution.
[02:19:06] - Previous termination of core was improper.
[02:19:06] - Files status OK
[02:19:12] - Expanded 46512850 -> 71846524 (decompressed 62.1 percent)
[02:19:12] Called DecompressByteArray: compressed_data_size=46512850 data_size=71846524, decompressed_data_size=71846524 diff=0
[02:19:12] - Digital signature verified
[02:19:12] 
[02:19:12] Project: 6903 (Run 2, Clone 13, Gen 39)
[02:19:12] 
[02:19:12] Entering M.D.
                         :-)  G  R  O  M  A  C  S  (-:

                   Groningen Machine for Chemical Simulation

                            :-)  VERSION 4.5.3  (-:

        Written by Emile Apol, Rossen Apostolov, Herman J.C. Berendsen,
      Aldert van Buuren, Pär Bjelkmar, Rudi van Drunen, Anton Feenstra, 
        Gerrit Groenhof, Peter Kasson, Per Larsson, Pieter Meulenhoff, 
           Teemu Murtola, Szilard Pall, Sander Pronk, Roland Schulz, 
                Michael Shirts, Alfons Sijbers, Peter Tieleman,

               Berk Hess, David van der Spoel, and Erik Lindahl.

       Copyright (c) 1991-2000, University of Groningen, The Netherlands.
            Copyright (c) 2001-2010, The GROMACS development team at
        Uppsala University & The Royal Institute of Technology, Sweden.
            check out http://www.gromacs.org for more information.


                               :-)  Gromacs  (-:

Reading file work/wudata_03.tpr, VERSION 4.5.4-dev-20110530-cc815 (single precision)
[02:19:21] Mapping NT from 48 to 48 
Starting 48 threads
Making 2D domain decomposition 8 x 6 x 1

WARNING: This run will generate roughly 7425 Mb of data

starting mdrun 'Overlay'
10000000 steps,  40000.0 ps.
[02:19:25] Completed 0 out of 10000000 steps  (0%)


Re: Project: 6904 (Run 2, Clone 14, Gen 51)

Posted: Sun Feb 05, 2012 3:30 am
by 7im
If you won't make the deadline, dump the WU, clean your TPR files (work dir), and restart.

Re: Project: 6904 (Run 2, Clone 14, Gen 51)

Posted: Sun Feb 05, 2012 4:05 am
by Grandpa_01
I just wonder what the deadline would be on something like that. That was on a 48 core rig and it did not do 1% in 2 hours the file size will be 7+GB when complete. :lol:

Re: project:6903 run:6 clone:2 gen:75

Posted: Sun Feb 05, 2012 9:57 pm
by sick willie
This WU hasn't failed yet, but my rig is taking 30 minutes per frame longer than normal for a 6903. I've restarted the machine with no change in this behavior.

Re: Project: 6904 (Run 2, Clone 14, Gen 51)

Posted: Sun Feb 05, 2012 10:05 pm
by sbinh
and we still don't get any official words from VJ folks?

Re: Project: 6904 (Run 2, Clone 14, Gen 51)

Posted: Sun Feb 05, 2012 10:11 pm
by toTOW
I think this is an old bug in WU generation which is coming back ... I pointed Kasson to this thread.

Re: Project: 6904 (Run 2, Clone 14, Gen 51)

Posted: Sun Feb 05, 2012 11:09 pm
by Leonardo
There are reports on Team forums pointing to the same problems in the last 24 hours with P6903 units as well.