Page 1 of 1

172 meg unitinfo.txt / Progress: 1723161600%

Posted: Wed Nov 12, 2008 5:56 pm
by parkut
Two WU's running on my Quad 6600 Linux boxes
unitinfo.txt files are both 172,316,322 bytes.

# cut -c 1-60 unitinfo.txt
Current Work Unit
-----------------
Name: Gromacs
Tag: P2674R3C99G21
Download time: November 12 15:48:59
Due time: November 15 15:48:59
Progress: 1723161600% [||||||||||||||||||||||||||||||||||||

grep Completed fah.log | tail -5
[17:24:55] Completed 37509 out of 249999 steps (15%)
[17:31:18] Completed 40009 out of 249999 steps (16%)
[17:37:41] Completed 42509 out of 249999 steps (17%)
[17:44:03] Completed 45009 out of 249999 steps (18%)
[17:50:29] Completed 47509 out of 249999 steps (19%)

# cut -c 1-60 unitinfo.txt
Current Work Unit
-----------------
Name: Gromacs
Tag: P2674R3C19G21
Download time: November 12 15:05:55
Due time: November 15 15:05:55
Progress: 1723161600% [||||||||||||||||||||||||||||||||||||
[root@quad1 fah6]# grep Completed fah.log | tail -5
[17:24:13] Completed 55009 out of 249999 steps (22%)
[17:30:31] Completed 57509 out of 249999 steps (23%)
[17:36:48] Completed 60009 out of 249999 steps (24%)
[17:43:04] Completed 62509 out of 249999 steps (25%)
[17:49:21] Completed 65009 out of 249999 steps (26%)

Re: 172 meg unitinfo.txt / Progress: 1723161600%

Posted: Wed Nov 12, 2008 6:29 pm
by Ivoshiee
There has been some reports recently.
Client does need some fixing.

Re: 172 meg unitinfo.txt / Progress: 1723161600%

Posted: Wed Nov 12, 2008 6:33 pm
by Xilikon
I like how the file size match the % in unitinfo.txt. This mean that there is no cap check in the client (if the % is over 100, something is wrong), thus having non-stop pipe char repeated thru the file.

Re: 172 meg unitinfo.txt / Progress: 1723161600%

Posted: Wed Nov 12, 2008 6:52 pm
by VijayPande
We'll take a look.

Re: 172 meg unitinfo.txt / Progress: 1723161600%

Posted: Wed Nov 12, 2008 9:29 pm
by parkut
Here's two more

86 meg unitinfo.txt / Progress: 859018%

# cut -c 1-60 unitinfo.txt
Current Work Unit
-----------------
Name: Gromacs
Tag: P2673R1C10G17
Download time: November 12 16:04:09
Due time: November 18 16:04:09
Progress: 859018% [||||||||||||||||||||||||||||||||||||||||

# ls -alt unitinfo.txt
-rw-r--r-- 1 root root 86,059 Nov 12 16:15 unitinfo.txt

# grep Completed fah.log | tail -5
[20:25:27] Completed 105010 out of 500001 steps (21%)
[20:37:52] Completed 110010 out of 500001 steps (22%)
[20:50:17] Completed 115010 out of 500001 steps (23%)
[21:02:42] Completed 120010 out of 500001 steps (24%)
[21:15:08] Completed 125010 out of 500001 steps (25%)



172 meg unitinfo.txt / Progress: 1723161600%

#cut -c 1-60 unitinfo.txt
Current Work Unit
-----------------
Name: Gromacs
Tag: P2674R1C157G21
Download time: November 12 17:57:58
Due time: November 15 17:57:58
Progress: 1723161600% [||||||||||||||||||||||||||||||||||||

#ls -alt unitinfo.txt
-rw-r--r-- 1 root root 172,316,323 Nov 12 16:23 unitinfo.txt

#grep Completed fah.log | tail -5
[21:03:39] Completed 92509 out of 249999 steps (37%)
[21:08:40] Completed 95009 out of 249999 steps (38%)
[21:13:41] Completed 97509 out of 249999 steps (39%)
[21:18:42] Completed 100009 out of 249999 steps (40%)
[21:23:42] Completed 102509 out of 249999 steps (41%)

Re: 172 meg unitinfo.txt / Progress: 1723161600%

Posted: Wed Nov 12, 2008 9:35 pm
by smoking2000
Xilikon wrote:I like how the file size match the % in unitinfo.txt. This mean that there is no cap check in the client (if the % is over 100, something is wrong), thus having non-stop pipe char repeated thru the file.
There is a | character for each % reported to be completed in unitinfo.txt, I counted them when first analyzing this bug when it caused FCI to slow to a crawl (it stored the progress bar in an XML file, parsing this XML file with one or more progress bars of several MB is not recommended ;)).

I've only seen this happen with projects for the a2 core. I have a copy of the folding directory with such a case saved for testing.

Dick Howells wuinfo (utility that parses the work/wuinfo_??.dat files) for this case shows:

Code: Select all

 index 6:
   Core: Core_a2
   Name: Gromacs
   Progress: 1718012% (4295029 of 250 steps)
This shows that either the completed number of steps or the total number of steps is not correct.

The corresponding work/logfile_06.txt:

Code: Select all

*------------------------------*
Folding@Home Gromacs SMP Core
Version 2.01 (Wed Aug 13 13:11:25 PDT 2008)

Preparing to commence simulation
- Ensuring status. Please wait.
- Assembly optimizations manually forced on.
- Not checking prior termination.
- Expanded 4838305 -> 24033653 (decompressed 496.7 percent)
Called DecompressByteArray: compressed_data_size=4838305 data_size=24033653, decompressed_data_size=24033653 diff=0
- Digital signature verified

Project: 2670 (Run 5, Clone 11, Gen 17)

Assembly optimizations on if available.
Entering M.D.
Completed 2510 out of 250001 steps  (1%)
Completed 5010 out of 250001 steps  (2%)
[....]
Completed 60010 out of 250001 steps  (24%)
Completed 62510 out of 250001 steps  (25%)
This progress looks sane, and allows qd to show the correct progress (it doesn't use wuinfo_??.dat if logfile_??.txt provides progress data).

Code: Select all

qd released 7 September 2008 (fr 071); qd info 10 November 2008 (update-qd.pl)
qd executed Wed Nov 12 22:20:35 CET 2008 (Wed Nov 12 21:20:35 UTC 2008)
Queue version 5.01
Current index: 6
[...]
 Index 6: folding now 1920.00 pts (104.433 pt/hr) 3.92 X min speed; 25% complete
   server: 171.67.108.24:8080; project: 2670
   Folding: run 5, clone 11, generation 17; benchmark 0; misc: 500, 200
   Project: 2670 (Run 5, Clone 11, Gen 17)
   issue: Tue Oct 28 12:20:38 2008; begin: Tue Oct 28 12:21:00 2008
   expect: Wed Oct 29 06:44:05 2008; due: Fri Oct 31 12:21:00 2008 (3 days)
   preferred: Fri Oct 31 12:21:00 2008 (3 days)
   core URL: http://www.stanford.edu/~pande/Linux/x86/Core_a2.fah (V2.01)
   CPU: 1,0 x86; OS: 4,0 Linux
   smp cores: 4; cores to use: 4
   tag: P2670R5C11G17
   flops: 1061161541 (1061.161541 megaflops)
   assignment info (le): Tue Oct 28 12:20:37 2008; BBC399E2
   CS: 171.67.108.25; P limit: 524286976
   user: [DPC]_Fatal_Error_Group0smoking2000; team: 92; ID: 9E3B81209D0E757D; mach ID: 1
   work/wudata_06.dat file size: 4838817; WU type: Folding@Home

Results successfully sent: Fri Jun  6 16:28:16 2008
Average download rate 377.409 KB/s (u=4); upload rate 65.087 KB/s (u=4)
Performance fraction 0.750157 (u=4)
Average pph: 75.427, ppd: 1810.25, ppw: 12671.7, ppy: 661176
The work/wudata_06.log file on the otherhand, shows it using the following steps:

Code: Select all

$ grep step work/wudata_06.log
   nsteps               = 250001
   init_step            = 4250000
   em_stepsize          = 0.01
   fc_stepsize          = 0
will use an extra communication step for exclusion forces for Reaction-Field
Charge group distribution at step 4250000: 13534 19290 16635 15187
DD  step 4250009  vol min/aver 1.000  load imb.: force 23.9%
DD  step 4250999  vol min/aver 0.754  load imb.: force  9.0%
Writing checkpoint, step 4251370 at Tue Oct 28 12:26:20 2008
DD  step 4251999  vol min/aver 0.775  load imb.: force  5.2%
Writing checkpoint, step 4252470 at Tue Oct 28 12:31:22 2008
[...]
DD  step 4308999  vol min/aver 0.719  load imb.: force 11.4%
Writing checkpoint, step 4309560 at Tue Oct 28 16:46:20 2008
DD  step 4309999  vol min/aver 0.773  load imb.: force  3.0%
Writing checkpoint, step 4310950 at Tue Oct 28 16:51:20 2008
DD  step 4310999  vol min/aver 0.780  load imb.: force  3.7%
DD  step 4311999  vol min/aver 0.773  load imb.: force  3.5%
Writing checkpoint, step 4312370 at Tue Oct 28 16:56:19 2008
Interestingly the closest mention of the step reported as the completed step in the wuinfo_06.dat is the following, not near the end of the file:

Code: Select all

DD  step 4294999  vol min/aver 0.799  load imb.: force  5.0%

           Step           Time         Lambda
        4295000     8590.00000        0.00000

   Energies (kJ/mol)
           Bond          Angle Ryckaert-Bell.          LJ-14     Coulomb-14
    1.76815e+04    4.78336e+04    4.69625e+04    2.17307e+04    3.00686e+05
        LJ (SR)  Disper. corr.   Coulomb (SR)       RF excl.      Potential
    1.96170e+05   -1.70938e+04   -2.29351e+06   -2.00248e+05   -1.87979e+06
    Kinetic En.   Total Energy    Temperature Pressure (bar)  Cons. rmsd ()
    3.98393e+05   -1.48140e+06    3.12924e+02   -2.40168e+02    3.67236e-06

Writing checkpoint, step 4295270 at Tue Oct 28 15:51:21 2008
I can send a copy of this folding directory for analysis if that would be appreciated.