Page 21 of 28

Re: GPU server status 171.67.108.21, 171.64.65.71,171.67.108.26

Posted: Sun Feb 21, 2010 1:46 am
by DrSpalding
I tried qfix and it didn't fix the queue.dat file to match the wu_results_XX.dat files still being on the machine. Perhaps I have an old qfix.exe, but I downloaded it on 2/10/2010 from the location noted above.

Re: GPU server status 171.67.108.21, 171.64.65.71,171.67.108.26

Posted: Sun Feb 21, 2010 2:14 am
by Sahkuhnder
DrSpalding wrote:I tried qfix and it didn't fix the queue.dat file to match the wu_results_XX.dat files still being on the machine.
I have had good results from qfix in the past but couldn't get it to work this time either. :(

Others also report the same:

smoking2000
smoking2000 wrote:Both the "old" qfix (Dick Howells last release), and the "new" qfix (my release supporting the v6 client, only a version bump no other changes), don't support the new wuresults_<nn>.dat format used by the new v5 Work Servers.
Pette Broad
Pette Broad wrote:QFix won't work on GPU units.

Re: GPU server status 171.67.108.21, 171.64.65.71,171.67.108.26

Posted: Sun Feb 21, 2010 4:33 am
by weedacres
When I ran qfix the output format was all messed up. Also notice the bogus PRCG.

Code: Select all

C:\fahbackup\gpu1>qfix
entry 3, status 0, address 0.0.0.0
  Found results <work\wuresults_03.dat>: proj 15699, run 0, clone 27315, gen 193
13
   -- queue entry: proj 0, run 0, clone 0, gen 0
   -- doesn't match queue entry
entry 4, status 0, address 0.0.0.0
  Found results <work\wuresults_04.dat>: proj 19240, run 0, clone 45005, gen 193
17
   -- queue entry: proj 0, run 0, clone 0, gen 0
   -- doesn't match queue entry
entry 5, status 0, address 0.0.0.0
  Found results <work\wuresults_05.dat>: proj 25205, run 0, clone 43141, gen 193
10
   -- queue entry: proj 0, run 0, clone 0, gen 0
   -- doesn't match queue entry
entry 6, status 0, address 0.0.0.0
  Found results <work\wuresults_06.dat>: proj 22399, run 0, clone 27475, gen 193
13
   -- queue entry: proj 0, run 0, clone 0, gen 0
   -- doesn't match queue entry
entry 7, status 0, address 0.0.0.0
entry 8, status 0, address 0.0.0.0
entry 9, status 0, address 0.0.0.0
  Found results <work\wuresults_09.dat>: proj 12806, run 0, clone 42846, gen 193
10
   -- queue entry: proj 0, run 0, clone 0, gen 0
   -- doesn't match queue entry
entry 0, status 0, address 0.0.0.0
  Found results <work\wuresults_00.dat>: proj 14107, run 0, clone 42876, gen 193
10
   -- queue entry: proj 0, run 0, clone 0, gen 0
   -- doesn't match queue entry
entry 1, status 0, address 171.64.65.20:8080
entry 2, status 1, address 171.64.122.70:8080
  Found results <work\wuresults_02.dat>: proj 14355, run 0, clone 14878, gen 193
11
   -- queue entry: proj 5907, run 1, clone 1, gen 1
   -- doesn't match queue entry
File is OK
It's possible this is a result of running the earlier version of qfix first. When I ran that last week there was no output, just went back to the prompt. At this point I don't remember which clients I ran it against.
Anyway, I'm not going to waste any more time on this. I only lost about 150 wu's :e?:

Re: GPU server status 171.67.108.21, 171.64.65.71,171.67.108.26

Posted: Sun Feb 21, 2010 4:37 am
by Tobit
weedacres wrote:Anyway, I'm not going to waste any more time on this. I only lost about 150 wu's :e?:
Sounds like you were overly impatient to me and you broke things by using several different versions of qfix. I didn't have to run qfix at all and all of my pending WUs were uploaded just fine eventually.

Re: GPU server status 171.67.108.21, 171.64.65.71,171.67.108.26

Posted: Sun Feb 21, 2010 5:43 am
by weedacres
Tobit wrote:
weedacres wrote:Anyway, I'm not going to waste any more time on this. I only lost about 150 wu's :e?:
Sounds like you were overly impatient to me and you broke things by using several different versions of qfix. I didn't have to run qfix at all and all of my pending WUs were uploaded just fine eventually.
It might seem that way but I only tried qfix on 2 of 10 clients. The rest are sitting there gathering dust.

Re: GPU server status 171.67.108.21, 171.64.65.71,171.67.108.26

Posted: Sun Feb 21, 2010 9:34 am
by noorman
.
C:\Documents and Settings\x>cd C:\trial-fah

C:\trial-fah>qfix
entry 7, status 0, address 171.67.108.11:8080
entry 8, status 0, address 171.64.122.70:8080
entry 9, status 0, address 171.67.108.11:8080
entry 0, status 0, address 171.64.122.70:8080
entry 1, status 0, address 171.67.108.11:8080
entry 2, status 0, address 171.64.65.20:8080
entry 3, status 0, address 171.64.65.20:8080
entry 4, status 0, address 171.64.65.71:8080
entry 5, status 0, address 171.64.122.70:8080
entry 6, status 1, address 171.64.122.70:8080
File is OK

C:\trial-fah>
.

just did this test on a copy of my GPU rig's queue.dat; the result seems normal (like it used to)
As it looks here, IMO it still works.
Because I don't have a busted one I can't be 100% sure though.
You should try out this version on an untouched 'copy' in a separate folder !

http://linuxminded.nl/?target=software- ... s.plc#qfix it 's the 2nd prog (10.00 kB in size)

.

Re: GPU server status 171.67.108.21, 171.64.65.71,171.67.108.26

Posted: Sun Feb 21, 2010 10:26 am
by noorman
.

For those who Fold with a GUI/Tray Client, you can right-click the Icon and ask to see the queueinfo from there.
You can select each slot (#) in the queue and get some idea of what is in there ...
It 's an easy check to see what 's in there and you can use it to compare with an output of Qfix (f.e.)

.

Re: GPU server status 171.67.108.21, 171.64.65.71,171.67.108.26

Posted: Sun Feb 21, 2010 11:09 am
by noorman
.

@ Bruce or 7im:

is qgen still usable (for a v6 Client) ?

This way, he could discard the 'mangled' queue.dat and generate a new one ...

Ref.: http://linuxminded.xs4all.nl/mirror/www ... /qgen.html

.

Re: GPU server status 171.67.108.21, 171.64.65.71,171.67.108.26

Posted: Sun Feb 21, 2010 11:22 am
by bollix47
Not Bruce or 7im but I did try qgen. It uses an older format and does not produce a currently valid queue.dat. The P/R/C/G are all wrong after running it as I'm sure are other pieces of info.

Re: GPU server status 171.67.108.21, 171.64.65.71,171.67.108.26

Posted: Sun Feb 21, 2010 12:05 pm
by noorman
bollix47 wrote:Not Bruce or 7im but I did try qgen. It uses an older format and does not produce a currently valid queue.dat. The P/R/C/G are all wrong after running it as I'm sure are other pieces of info.
.

I feared for that and I seemed to remember that I couldn't use qgen either for the v6 Client SMP stuff.
Luckily - then - we only needed the v6 compatible version of qfix ...

Sad that no programmer 's ever felt the need to update this little tool for emergency use.
We don't have that amount of problems like we did in the past, but still, in certain circumstances we could fix things if we had tools that were compatible with the Client version that 's currently in use ...
I 'm sorry to say that I can't code; if I could, I 'd give it a try.


.

Re: GPU server status 171.67.108.21, 171.64.65.71,171.67.108.26

Posted: Sun Feb 21, 2010 3:24 pm
by DavidMudkips
Sorry if this was brought up during the past twenty pages, but has anyone found a solution to upload WU's that have since been overriden in the queue? I think one of them has had its deadline expired, but the other 4 have relatively long deadlines.

I have 5 WU .dat files sitting in a backup folder all of which I got "Server has already received unit" errors on and the client has never tried to resend, even when they were still in the queue.

Re: GPU server status 171.67.108.21, 171.64.65.71,171.67.108.26

Posted: Sun Feb 21, 2010 3:56 pm
by noorman
DavidMudkips wrote:Sorry if this was brought up during the past twenty pages, but has anyone found a solution to upload WU's that have since been overriden in the queue? I think one of them has had its deadline expired, but the other 4 have relatively long deadlines.

I have 5 WU .dat files sitting in a backup folder all of which I got "Server has already received unit" errors on and the client has never tried to resend, even when they were still in the queue.
.

Have you checked if these gave a "succesfully sent" and "thank you ..." from the server ?
That might be the cause of that message ... (as told before)
Why the Client would still be re-trying, I don't know.

Best to check the log(s) for that all-telling info on these WU's (results)

.

Re: GPU server status 171.67.108.21, 171.64.65.71,171.67.108.26

Posted: Sun Feb 21, 2010 6:48 pm
by Teddy
Sadly the crap continues

Code: Select all

[18:02:00] Completed 100%
[18:02:00] Successful run
[18:02:00] DynamicWrapper: Finished Work Unit: sleep=10000
[18:02:10] Reserved 101264 bytes for xtc file; Cosm status=0
[18:02:10] Allocated 101264 bytes for xtc file
[18:02:10] - Reading up to 101264 from "work/wudata_00.xtc": Read 101264
[18:02:10] Read 101264 bytes from xtc file; available packet space=786329200
[18:02:10] xtc file hash check passed.
[18:02:10] Reserved 30216 30216 786329200 bytes for arc file=<work/wudata_00.trr> Cosm status=0
[18:02:10] Allocated 30216 bytes for arc file
[18:02:10] - Reading up to 30216 from "work/wudata_00.trr": Read 30216
[18:02:10] Read 30216 bytes from arc file; available packet space=786298984
[18:02:10] trr file hash check passed.
[18:02:10] Allocated 560 bytes for edr file
[18:02:10] Read bedfile
[18:02:10] edr file hash check passed.
[18:02:10] Logfile not read.
[18:02:10] GuardedRun: success in DynamicWrapper
[18:02:10] GuardedRun: done
[18:02:10] Run: GuardedRun completed.
[18:02:15] + Opened results file
[18:02:15] - Writing 132552 bytes of core data to disk...
[18:02:15] Done: 132040 -> 131571 (compressed to 99.6 percent)
[18:02:15]   ... Done.
[18:02:15] DeleteFrameFiles: successfully deleted file=work/wudata_00.ckp
[18:02:15] Shutting down core 
[18:02:15] 
[18:02:15] Folding@home Core Shutdown: FINISHED_UNIT
[18:02:17] CoreStatus = 64 (100)
[18:02:17] Unit 0 finished with 98 percent of time to deadline remaining.
[18:02:17] Updated performance fraction: 0.981101
[18:02:17] Sending work to server
[18:02:17] Project: 10105 (Run 163, Clone 5, Gen 11)


[18:02:17] + Attempting to send results [February 21 18:02:17 UTC]
[18:02:17] - Reading file work/wuresults_00.dat from core
[18:02:17]   (Read 132083 bytes from disk)
[18:02:17] Connecting to http://171.64.65.71:8080/
[18:02:18] - Couldn't send HTTP request to server
[18:02:18] + Could not connect to Work Server (results)
[18:02:18]     (171.64.65.71:8080)
[18:02:18] + Retrying using alternative port
[18:02:18] Connecting to http://171.64.65.71:80/
[18:02:26] Posted data.
[18:02:26] Initial: 0000; - Server does not have record of this unit. Will try again later.
[18:02:26] - Error: Could not transmit unit 00 (completed February 21) to work server.
[18:02:26] - 1 failed uploads of this unit.
[18:02:26]   Keeping unit 00 in queue.
[18:02:26] Trying to send all finished work units
[18:02:26] Project: 10105 (Run 163, Clone 5, Gen 11)


[18:02:26] + Attempting to send results [February 21 18:02:26 UTC]
[18:02:26] - Reading file work/wuresults_00.dat from core
[18:02:26]   (Read 132083 bytes from disk)
[18:02:26] Connecting to http://171.64.65.71:8080/
[18:02:32] Posted data.
[18:02:32] Initial: 0000; - Uploaded at ~21 kB/s
[18:02:32] - Averaged speed for that direction ~22 kB/s
[18:02:32] - Server does not have record of this unit. Will try again later.
[18:02:32] - Error: Could not transmit unit 00 (completed February 21) to work server.
[18:02:32] - 2 failed uploads of this unit.


[18:02:32] + Attempting to send results [February 21 18:02:32 UTC]
[18:02:32] - Reading file work/wuresults_00.dat from core
[18:02:32]   (Read 132083 bytes from disk)
[18:02:32] Connecting to http://171.67.108.26:8080/
[18:22:33] Posted data.
The client has just sat there for 20minutes now not doing anything....
& I thought there was some changes...

Teddy

no record of unit problem is back

Posted: Sun Feb 21, 2010 6:58 pm
by dschief
this box has been locked up for 40+ min with no activity

Code: Select all

18:05:06] Completed 98%
[18:06:39] Completed 99%
Writing final coordinates.



	M E G A - F L O P S   A C C O U N T I N G

   RF=Reaction-field  Free=Free Energy  SC=Softcore
   T=Tabulated        S=Solvent         W=Water     WW=Water-Water

           Computing:      M-Number      M-Flop's  % Flop's
             NS-Pairs      0.785631     16.498251     0.0
               CG-CoM      0.001253      0.036337     0.0
               Update  12540.001254  388740.038874    67.4
            Calc-Ekin      0.001254      0.033858     0.0
              Shake-V  12540.001254  188100.018810    32.6
          Total                576856.62613   100.0

               NODE (s)   Real (s)      (%)
       Time:   9241.141   9244.000    100.0
                       2h34:01
               (Mnbf/s)   (MFlops) (ps/NODE hour) (NODE hour/ns)
Performance:      0.000     62.423   7791.246      0.128
[18:08:11] Completed 100%
[18:08:11] Successful run
[18:08:11] DynamicWrapper: Finished Work Unit: sleep=10000
GuardedRun: success in DynamicWrapper
[18:08:21] Reserved 101656 bytes for xtc file; Cosm status=0
[18:08:21] Allocated 101656 bytes for xtc file
[18:08:21] - Reading up to 101656 from "work/wudata_09.xtc": Read 101656
[18:08:21] Read 101656 bytes from xtc file; available packet space=786328808
[18:08:21] xtc file hash check passed.
[18:08:21] Reserved 30216 30216 786328808 bytes for arc file=<work/wudata_09.trr> Cosm status=0
[18:08:21] Allocated 30216 bytes for arc file
[18:08:21] - Reading up to 30216 from "work/wudata_09.trr": Read 30216
[18:08:21] Read 30216 bytes from arc file; available packet space=786298592
[18:08:21] trr file hash check passed.
[18:08:21] Allocated 560 bytes for edr file
[18:08:21] Read bedfile
[18:08:21] edr file hash check passed.
[18:08:21] Logfile not read.
[18:08:21] GuardedRun: success in DynamicWrapper
[18:08:21] GuardedRun: done
[18:08:21] Run: GuardedRun completed.
[18:08:26] + Opened results file
[18:08:26] - Writing 132944 bytes of core data to disk...
[18:08:26] Done: 132432 -> 131998 (compressed to 99.6 percent)
[18:08:26]   ... Done.
[18:08:26] DeleteFrameFiles: successfully deleted file=work/wudata_09.ckp
[18:08:26] Shutting down core 
[18:08:26] 
[18:08:26] Folding@home Core Shutdown: FINISHED_UNIT
[18:08:29] CoreStatus = 64 (100)
[18:08:29] Sending work to server
[18:08:29] Project: 10105 (Run 20, Clone 3, Gen 15)
[18:08:29] - Read packet limit of 540015616... Set to 524286976.


[18:08:29] + Attempting to send results [February 21 18:08:29 UTC]
[18:08:34] - Server does not have record of this unit. Will try again later.
[18:08:34] - Error: Could not transmit unit 09 (completed February 21) to work server.
[18:08:34]   Keeping unit 09 in queue.
[18:08:34] Project: 10105 (Run 20, Clone 3, Gen 15)
[18:08:34] - Read packet limit of 540015616... Set to 524286976.


[18:08:34] + Attempting to send results [February 21 18:08:34 UTC]
[18:08:37] - Server does not have record of this unit. Will try again later.
[18:08:37] - Error: Could not transmit unit 09 (completed February 21) to work server.
[18:08:37] - Read packet limit of 540015616... Set to 524286976.


[18:08:37] + Attempting to send results [February 21 18:08:37 UTC]


killed client with ctrl c
waited a couple on min then re-started, package up-loaded right away.

Code: Select all

8:55:56] Project: 10105 (Run 20, Clone 3, Gen 15)

[18:55:56] - Read packet limit of 540015616... Set to 524286976.





[18:55:56] + Attempting to send results [February 21 18:55:56 UTC]

[18:55:56] - Preparing to get new work unit...

[18:55:56] + Attempting to get work packet

[18:55:56] - Connecting to assignment server

[18:55:58] - Successful: assigned to (171.64.65.20).

[18:55:58] + News From Folding@Home: Welcome to Folding@Home

[18:55:59] Loaded queue successfully.

[18:56:00] + Results successfully sent

[18:56:00] Thank you for your contribution to Folding@Home.

[18:56:00] + Number of Units Completed: 17


[18:56:00] + Closed connections

[18:56:00] 

[18:56:00] + Processing work unit

[18:56:00] Core required: FahCore_14.exe

[18:56:00] Core found.

[18:56:00] Working on queue slot 00 [February 21 18:56:00 UTC]

[18:56:00] + Working ...

[18:56:00] 

[18:56:00] *------------------------------*

[18:56:00] Folding@Home GPU Core - Beta

[18:56:00] Version 1.26 (Wed Oct 14 13:09:26 PDT 2009)

[18:56:00] 

[18:56:00] Compiler  : Microsoft (R) 32-bit C/C++ Optimizing Compiler Version 14.00.50727.762 for 80x86

[18:56:00] Build host: vspm46

[18:56:00] Board Type: Nvidia

[18:56:00] Core      : 

[18:56:00] Preparing to commence simulation

[18:56:00] - Looking at optimizations...

[18:56:00] - Created dyn

[18:56:00] - Files status OK

[18:56:00] - Expanded 70190 -> 360060 (decompressed 512.9 percent)

[18:56:00] Called DecompressByteArray: compressed_data_size=70190 data_size=360060, decompressed_data_size=360060 diff=0

[18:56:00] - Digital signature verified

[18:56:00] 

[18:56:00] Project: 5910 (Run 14, Clone 117, Gen 13)

[18:56:00] 

[18:56:00] Assembly optimizations on if available.

[18:56:00] Entering M.D.

[18:56:06] Tpr hash work/wudata_00.tpr:  1857284088 2042218122 3721759850 2062985753 3590496030

[18:56:07] Working on Protein

[18:56:08] Client config found, loading data.

[18:56:08] Starting GUI Server

[18:56:59] Completed 1%

[18:58:13] Completed 2%

[18:59:26] Completed 3%
there still seems to be an issue with server comms, if I had'nt checked , it could have sat there for hours

Re: GPU server status 171.67.108.21, 171.64.65.71,171.67.108.26

Posted: Sun Feb 21, 2010 7:07 pm
by bollix47
Yes, here too with the same server 171.64.65.71

Code: Select all

[18:06:01] Sending work to server
[18:06:01] Project: 10105 (Run 100, Clone 0, Gen 14)
[18:06:01] - Read packet limit of 540015616... Set to 524286976.
[18:06:01] + Attempting to send results [February 21 18:06:01 UTC]
[18:06:01] - Reading file work/wuresults_03.dat from core
[18:06:01]   (Read 132690 bytes from disk)
[18:06:01] Connecting to http://171.64.65.71:8080/
[18:06:02] Posted data.
[18:06:02] Initial: 0000; - Uploaded at ~130 kB/s
[18:06:02] - Averaged speed for that direction ~66 kB/s
[18:06:02] - Server does not have record of this unit. Will try again later.
[18:06:02] - Error: Could not transmit unit 03 (completed February 21) to work server.
[18:06:02] - 1 failed uploads of this unit.
[18:06:02]   Keeping unit 03 in queue.
[18:06:02] Trying to send all finished work units
[18:06:02] Project: 10105 (Run 100, Clone 0, Gen 14)
[18:06:02] - Read packet limit of 540015616... Set to 524286976.
[18:06:02] + Attempting to send results [February 21 18:06:02 UTC]
[18:06:02] - Reading file work/wuresults_03.dat from core
[18:06:02]   (Read 132690 bytes from disk)
[18:06:02] Connecting to http://171.64.65.71:8080/
[18:06:04] Posted data.
[18:06:04] Initial: 0000; - Uploaded at ~65 kB/s
[18:06:04] - Averaged speed for that direction ~66 kB/s
[18:06:04] - Server does not have record of this unit. Will try again later.
[18:06:04] - Error: Could not transmit unit 03 (completed February 21) to work server.
[18:06:04] - 2 failed uploads of this unit.
[18:06:04] - Read packet limit of 540015616... Set to 524286976.
[18:06:04] + Attempting to send results [February 21 18:06:04 UTC]
[18:06:04] - Reading file work/wuresults_03.dat from core
[18:06:04]   (Read 132690 bytes from disk)
[18:06:04] Connecting to http://171.67.108.26:8080/
At this point client sat there for 30 minutes doing what I don't know.

A restart of the client sent the WU to 171.64.65.71 and a new WU from a different server was received.