Page 1 of 1

P2605 Run11 clone 497 Generation 71

Posted: Thu Jul 10, 2008 3:56 pm
by dutchmm
It's World Taekwondo Federation time again. This unit seemed to complete successfully. Here is the nohup.out output:

Code: Select all

[14:45:57] Writing local files
[14:45:57] Completed 495000 out of 500000 steps  (99 percent)
[14:58:05] Writing local files
[14:58:05] Completed 500000 out of 500000 steps  (100 percent)



        M E G A - F L O P S   A C C O U N T I N G

        Parallel run - timing based on wallclock.
   RF=Reaction-Field  FE=Free Energy  SCFE=Soft-Core/Free Energy
   T=Tabulated        W3=SPC/TIP3p    W4=TIP4p (single or pairs)
   NF=No Forces

 Computing:                        M-Number         M-Flops  % of Flops
----[14:58:05] Writing final coordinates.
-------------------------------------------------------------------
 VdW(T)                      1364641.594393 73690646.097222    16.6
 RF Coul                      545869.157252 18013682.189316     4.1
 RF Coul [W3]                   2257.835336   221267.862928     0.0
 RF Coul + VdW(T)             617564.629043 40141700.887795     9.0
 RF Coul + VdW(T) [W3]        280820.440646 36506657.283980     8.2
 RF Coul + VdW(T) [W3-W3]     769776.706149 246328545.967680    55.5
 Outer nonbonded loop         237847.898865  2378478.988650     0.5
 1,4 nonbonded interactions     1059.002118    95310.190620     0.0
 NS-Pairs                     501069.584250 10522461.269250     2.4
 Reset In Box                   3878.527569    34906.748121     0.0
 Shift-X                       77469.154938   464814.929628     0.1
 CG-CoM                         1767.035340    51244.024860     0.0
 Sum Forces                   116353.732707   116353.732707     0.0
 Bonds                         13273.526547   570761.641521     0.1
 Angles                        15621.031242  2546228.092446     0.6
 Propers                        5386.510773  1233510.967017     0.3
 Impropers                      1008.002016   209664.419328     0.0
 RB-Dihedrals                  14140.528281  3492710.485407     0.8
 Virial                        38838.577677   699094.398186     0.2
 Update                        38784.577569  1202321.904639     0.3
 Stop-CM                       38784.500000   387845.000000     0.1
 P-Coupling                    38784.577569   232707.465414     0.1
 Calc-Ekin                     38784.655138  1047185.688726     0.2
 Constraint-V                  38784.577569   232707.465414     0.1
 Constraint-Vir                25212.050424   605089.210176     0.1
 Settle                         8404.016808  2714497.428984     0.6
-----------------------------------------------------------------------
 Total                                      443740394.340015   100.0
-----------------------------------------------------------------------

               NODE (s)   Real (s)      (%)
       Time:  71349.000  71349.000    100.0
                       19h49:09
               (Mnbf/s)   (GFlops)   (ns/day)  (hour/ns)
Performance:     50.189      6.219      1.211     19.819
[14:58:05] Past main M.D. loop
[14:58:05] Will end MPI now
[14:59:05]
[14:59:05] Finished Work Unit:
[14:59:05] - Reading up to 3723552 from "work/wudata_05.arc": Read 3723552
[14:59:05] - Reading up to 1779300 from "work/wudata_05.xtc": Read 1779300
[14:59:05] goefile size: 0
[14:59:05] logfile size: 16925
[14:59:05] Leaving Run
[14:59:08] - Writing 5524177 bytes of core data to disk...
[14:59:08]   ... Done.
[14:59:09] - Shutting down core
[14:59:09]
After that, it sat there like a sausage for 25 minutes. So I verified that my network connection was still active, and that there were no fah6 processes still using the processor, and killed the parent.

qfix shows:

Code: Select all

entry 6, status 0, address 171.64.65.56:8080
entry 7, status 0, address 171.64.65.56:8080
entry 8, status 0, address 171.64.65.56:8080
entry 9, status 0, address 171.64.65.56:8080
entry 0, status 0, address 171.64.65.56:8080
entry 1, status 0, address 171.64.65.56:8080
entry 2, status 0, address 171.64.65.56:8080
entry 3, status 0, address 171.64.65.56:8080
entry 4, status 0, address 171.64.65.56:8080
entry 5, status 1, address 171.64.65.56:8080
  Found results <work/wuresults_05.dat>: proj 2605, run 11, clone 497, gen 71
   -- queue entry: proj 2605, run 11, clone 497, gen 71
   -- queue entry isn't empty
File is OK
But ./fah6 -send 5 -verbosity 9 gives me:

Code: Select all

Launch directory: /home/mike/Download
Executable: ./fah6
Arguments: -send 5 -verbosity 9

[15:52:05] - Ask before connecting: No
[15:52:05] - User name: Dutchmm (Team 31574)
[15:52:05] - User ID: 2C9583CC4EA765F8
[15:52:05] - Machine ID: 1
[15:52:05]
[15:52:05] Loaded queue successfully.
[15:52:05] Attempting to return result(s) to server...
[15:52:05] - Warning: Asked to send unfinished unit to server
[15:52:05] - Failed to send unit 05 to server
[15:52:05] ***** Got a SIGTERM signal (15)
[15:52:05] Killing all core threads

Folding@Home Client Shutdown.
Nor will ./fah6 -smp start another unit:

Code: Select all

Launch directory: /home/mike/Download
Executable: ./fah6
Arguments: -smp

[15:53:24] - Ask before connecting: No
[15:53:24] - User name: Dutchmm (Team 31574)
[15:53:24] - User ID: 2C9583CC4EA765F8
[15:53:24] - Machine ID: 1
[15:53:24]
[15:53:24] Loaded queue successfully.
[15:53:24]
[15:53:24] + Processing work unit
[15:53:24] Core required: FahCore_a1.exe
[15:53:24] Core found.
[15:53:24] Working on Unit 05 [July 10 15:53:24]
[15:53:24] + Working ...
[15:53:25]
[15:53:25] *------------------------------*
[15:53:25] Folding@Home Gromacs SMP Core
[15:53:25] Version 1.74 (November 27, 2006)
[15:53:25]
[15:53:25] Preparing to commence simulation
[15:53:25] - Ensuring status. Please wait.
[15:53:42] - Looking at optimizations...
[15:53:42] - Working with standard loops on this execution.
[15:53:42] Examination of work files indicates 8 consecutive improper terminations of core.
[15:53:42] Finalizing output
[15:53:42] - Starting from initial work packet
[15:53:42]
[15:53:42] Project: 0 (Run 0, Clone 0, Gen 0)
[15:53:42]
[15:53:42] Error: Could not write local file.  Exiting.
[15:53:42] - Shutting down core
I am sorry to say this is urgent, as I am leaving for a month's holiday tomorrow afternoon, and I shall lose about 40 WUs if we are unable to solve this problem. ATM, my thought would be to take the results file and save them somewhere else, then use qdclear to nix the work directory, so I can get the show on the road.

Thx in advance for any advice from Taekwondo blackbelts or folding experts.

Mike

Re: P2605 Run11 clone 497 Generation 71

Posted: Thu Jul 10, 2008 4:09 pm
by toTOW
Try :
> ./fah6 -delete 05
> qfix
> ./fah6 -send all

Re: P2605 Run11 clone 497 Generation 71

Posted: Thu Jul 10, 2008 4:30 pm
by uncle_fungus
There is a reason that this topic is a sticky at the top of the Linux SMP forums: viewtopic.php?f=12&t=1938 ;)

Re: P2605 Run11 clone 497 Generation 71

Posted: Thu Jul 10, 2008 4:42 pm
by dutchmm
Tried it (after saving the results, in case they are of any interest to the Pande group), but:

Code: Select all

Launch directory: /home/mike/Download
Executable: ./fah6
Arguments: -smp

[15:53:24] - Ask before connecting: No
[15:53:24] - User name: Dutchmm (Team 31574)
[15:53:24] - User ID: 2C9583CC4EA765F8
[15:53:24] - Machine ID: 1
[15:53:24]
[15:53:24] Loaded queue successfully.
[15:53:24]
[15:53:24] + Processing work unit
[15:53:24] Core required: FahCore_a1.exe
[15:53:24] Core found.
[15:53:24] Working on Unit 05 [July 10 15:53:24]
[15:53:24] + Working ...
[15:53:25]
[15:53:25] *------------------------------*
[15:53:25] Folding@Home Gromacs SMP Core
[15:53:25] Version 1.74 (November 27, 2006)
[15:53:25]
[15:53:25] Preparing to commence simulation
[15:53:25] - Ensuring status. Please wait.
[15:53:42] - Looking at optimizations...
[15:53:42] - Working with standard loops on this execution.
[15:53:42] Examination of work files indicates 8 consecutive improper terminations of core.
[15:53:42] Finalizing output
[15:53:42] - Starting from initial work packet
[15:53:42]
[15:53:42] Project: 0 (Run 0, Clone 0, Gen 0)
[15:53:42]
[15:53:42] Error: Could not write local file.  Exiting.
[15:53:42] - Shutting down core
What the Federation next?

Mike

Re: P2605 Run11 clone 497 Generation 71

Posted: Thu Jul 10, 2008 4:46 pm
by uncle_fungus
Have you got the log showing your -delete, qfix, -send all attempts?

Re: [solved] P2605 Run11 clone 497 Generation 71

Posted: Thu Jul 10, 2008 6:07 pm
by dutchmm
Hadn't seen the sticky ... although I thought I had read all of them in this forum. Anyway, problem solved, and folding again. Now I need to come up with a script to check the status and correct it in case it goes tits up again while I am away.