General server issues?

Moderators: Site Moderators, FAHC Science Team

endrik
Posts: 34
Joined: Mon Dec 10, 2007 10:41 pm
Location: Wroclaw, Poland
Contact:

General server issues?

Post by endrik »

Hi all,

during last week I lost two or three WUs that were uploaded with 'all green' messages, but not acknowledged with points. I wasn't especially concerned about this and blamed a bit too hectic sneakernetting on my side, but then similar problems were reported from three other guys in our team, so I decided to post the thing . Right now I don't know their details, just that it was the same story (like 'Thank you for your contribution to f@h' message from 171.64.65.20 GPU server and no points).
I do know my details though; for example there were two WU's done simultaneously on a dual core, 2620 (49,96,4) and 2621 (50,71,4) of same 171.64.65.65 server (completed Jan 5), and the funny thing is that both were successfully uploaded, but only one awarded. Same thing with 2606 (35,0,135) completed Jan 2 (sent/accepted only yesterday, as earlier 'server had no knowledge' of that Wu) and 2417 (29,39,13) of 171.65.103.162. To say the truth I can't be quite sure that all these complaints are valid - troubale is when you upload several WUs, you don't really know which one was accounted - but these are as close as I can get it. Still, sure thing is that some of them went amiss.
As said, originally I just wanted to let it go/wait some more, but as other people encountered it as well... Any ideas?

PS. This night's 2566 (Run 73, Clone 88, Gen 12) uploaded to 171.64.122.139 also did not show up so far. Some kind of New Year's hangover or what?
yours,
endrik

*Bookworms will rule the world
(after we finish the background reading).
bruce
Posts: 20824
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: General server issues?

Post by bruce »

endrik wrote:Hi all,

I do know my details though; for example there were two WU's done simultaneously on a dual core, 2620 (49,96,4) and 2621 (50,71,4) of same 171.64.65.65 server (completed Jan 5), and the funny thing is that both were successfully uploaded, but only one awarded. Same thing with 2606 (35,0,135) completed Jan 2 (sent/accepted only yesterday, as earlier server had no knowledge of this Wu) and 2417 (29,39,13) of 171.65.103.162. To say the truth I can't be quite sure that all these complaints are valid - trouble is when you upload several WUs, you don't really know which one was accounted - but these are as close as I can get it. Still, sure thing is that some of them went amiss.
As said, originally I just wanted to let it go/wait some more, but as other people encountered it as well... Any ideas?
Apparently you are manipulating the data assigned to you in some way that is confusing both the client(s) and the servers because the server is receiving more than one upload for each WU. Are you moving data between more than one computer? Are you running more than one client on a single computer without unique MachineIDs? Are you using the -local flag? Which clients are you running? What kind of hardware do you have?

Please find the portions of FAHlog.txt or FAHlog-Prev.txt showing the download and upload of Project: 2620 (Run 49 Clone 96 Gen 4). I can see that it was uploaded but no credit was granted. Without the data from your log, I can't find a good explanation.

Project: 2621 Run 50 Clone 71 Gen 4 was uploaded three times but did receive credit for the first on.
Hi endrik (team 276),
Your WU (P2621 R50 C71 G4) was added to the stats database on 2008-01-05 07:02:19 for 292 points of credit.
Your WU (P2621 R50 C71 G4) was added to the stats database on 2008-01-05 19:01:02 for 0 points of credit.
Your WU (P2621 R50 C71 G4) was added to the stats database on 2008-01-05 19:01:02 for 0 points of credit.

Project: 2606 (Run 35, Clone 0, Gen 135) is like the first one -- uploaded twice but not credited. I need the data from FAHlog.txt to determine what happened.

Project: 2417 Run 29 Clone 39 Gen 13 is quite clear. You uploaded it twice -- once with the incorrect UserName/TeamNumber ant then you uploaded it again with the correct UserName TeamNumber. No matter what UserName / TeamNumber receives credit for a given WU, you can only get credit for it once.
Hi _ (team 0),
Your WU (P2417 R29 C39 G13) was added to the stats database on 2008-01-01 14:55:14 for 500 points of credit.
Hi endrik (team 276),
Your WU (P2417 R29 C39 G13) was added to the stats database on 2008-01-05 18:57:53 for 0 points of credit.
endrik
Posts: 34
Joined: Mon Dec 10, 2007 10:41 pm
Location: Wroclaw, Poland
Contact:

Re: General server issues?

Post by endrik »

bruce wrote:Please find the portions of FAHlog.txt or FAHlog-Prev.txt showing the download and upload of Project: 2620 (Run 49 Clone 96 Gen 4).

Code: Select all

[12:27:38] Working on Unit 01 [January 5 12:27:38]
[12:27:44] Project: 2620 (Run 49, Clone 96, Gen 4)
[12:28:06] Protein: p2620_p1475_tet1_03_1 t= 20000.00000
[12:28:11] Extra SSE boost OK.
[12:49:03] Completed 123750 out of 125000 steps  (99)
[13:07:59] Completed 125000 out of 125000 steps  (100)(...)
[13:09:05] - Writing 9007326 bytes of core data to disk...
[13:09:12] Folding@home Core Shutdown: FINISHED_UNIT
[13:09:14] CoreStatus = 64 (100)
[13:09:14] Sending work to server
[13:09:14] + Attempting to send results
[13:09:14] - Couldn't send HTTP request to server
[13:09:14] + Could not connect to Work Server (results)
[13:09:14]     (171.64.65.65:8080)
[13:09:14] - Error: Could not transmit unit 01 (completed January 5) to work server.
[13:09:14]   Keeping unit 01 in queue.
(...)
--- Opening Log file [January 6 01:30:33] (...)
[01:34:04] + Attempting to get work packet
[01:34:04] - Connecting to assignment server
[01:34:08] - Successful: assigned to (130.49.240.81).
[01:34:08] + News From Folding@Home: Welcome to Folding@Home
[01:34:08] Loaded queue successfully.
[01:34:22] + Results successfully sent
[01:34:22] Thank you for your contribution to Folding@Home.
Folding@Home Client Shutdown.
bruce wrote:Project: 2606 (Run 35, Clone 0, Gen 135) is like the first one -- uploaded twice but not credited. I need the data from FAHlog.txt to determine what happened.

Code: Select all

[11:48:32] Working on Unit 01 [January 2 11:48:32]
[11:48:50] Project: 2606 (Run 35, Clone 0, Gen 135)
[11:49:16] (Starting from checkpoint)
[11:49:16] Protein: p2606_tet_1499
[11:49:16] Completed 123104 out of 125000 steps  (98)
[12:21:02] Completed 125000 out of 125000 steps  (100)
[12:21:02] Writing final coordinates. (...)
[01:13:56] + Could not connect to Work Server (results)
[01:13:56]     (171.64.65.65:8080)
[01:13:56] - Error: Could not transmit unit 01 (completed January 2) to work server.
[01:14:39] + Could not connect to Work Server (results)
[01:14:39]     (171.65.103.100:8080)
[01:14:39]   Could not transmit unit 01 to Collection server; keeping in queue.
--- Opening Log file [January 6 01:15:59] 
[01:16:01] + Attempting to send results
[01:16:01] - Presenting message box asking to network.
[01:19:18] + Results successfully sent
[01:19:18] Thank you for your contribution to Folding@Home.
[01:19:18] + Number of Units Completed: 8
Then, there is the last one:

Code: Select all

[23:56:59] Working on Unit 03 [January 5 23:56:59]
[23:57:00] Project: 2566 (Run 73, Clone 88, Gen 12)
[23:57:20] Protein: p2566_BBA5_ext
[23:59:30] Completed 198000 out of 200000 steps  (99)
[00:14:44] Completed 200000 out of 200000 steps  (100)
[00:14:44] Writing final coordinates. (...)
[00:16:09] - Error: Could not transmit unit 03 (completed January 6) to work server.
[00:16:09]   Keeping unit 03 in queue.
[00:16:09] + Attempting to send results
[00:16:41] + Results successfully sent
[00:16:41] Thank you for your contribution to Folding@Home.
Multiple uploads represent my manual efforts of getting it through, and that 2417 was my illfated sneakernetting, as I suspected (good to know that the results DID reach Stanford - that's what matters in the end).
I did some editing in the logs above in order to shorten them, but hope all crucial shows.
Thank you for your reaction - I can get the other complainers here too, but don't want to drown you. One prolific cruncher of ours says he is just used to getting some 10% of his contribution blown in the Universe...
Last edited by endrik on Sun Jan 06, 2008 7:28 pm, edited 1 time in total.
yours,
endrik

*Bookworms will rule the world
(after we finish the background reading).
Obcy_from_Poznan
Posts: 11
Joined: Sun Jan 06, 2008 2:06 pm
Hardware configuration: HIS Radeon HD4890
Location: Poland Poznan

Re: General server issues?

Post by Obcy_from_Poznan »

Hello!
I confirm what Endrik writes and complaints about.
There must be some severe issues with the FAH servers because I also noted that I had been not granted points for a couple of WUs.
I use the GPU Graphical User Interface client and employ my X1950Pro for this burden. Usually I score about 330 points per day, so it is easy for me to control the stats whether the points are accounted for in fair way.
The first case took place in the middle of December, but I do not have any logs from this time yet. And I thought that once in a long time such a thing as missing a single WU in Folding stats could happen, although it is always somewhat disappointing when you have friends in your team competing with you and suddenely you score 0 points for the whole day of your computer's work. Frustrating for a dedicated "folder" :roll:
The second case when I was granted no points took place the day before yesterday, i.e. on January the 4th, and luckily I have the logs:

Code: Select all

--- Opening Log file [January 4 10:16:16] 


# Windows Graphical GPU Edition ###############################################
###############################################################################

                       Folding@Home Client Version 5.91beta6

                          http://folding.stanford.edu

###############################################################################
###############################################################################

Launch directory: C:\FX\FAH5.91beta6-GPU-GUI


[10:16:16] - Ask before connecting: No
[10:16:16] - User name: Obcy_from_Poznan (Team 276)
[10:16:16] - User ID: 37C8D734257148D2
[10:16:16] - Machine ID: 1
[10:16:16] 
[10:16:16] Loaded queue successfully.
[10:16:16] Initialization complete
[10:16:16] + Benchmarking ...
[10:16:17] 
[10:16:17] + Processing work unit
[10:16:17] Core required: FahCore_10.exe
[10:16:17] Core found.
[10:16:17] Working on Unit 07 [January 4 10:16:17]
[10:16:17] + Working ...
[10:16:17] 
[10:16:17] *------------------------------*
[10:16:17] Folding@Home GPU Core - Beta
[10:16:17] Version 0.10 (Mon Oct 30 12:32:17 PST 2006)
[10:16:17] 
[10:16:17] Compiler  : Microsoft (R) 32-bit C/C++ Optimizing Compiler Version 13.10.3077 for 80x86
[10:16:17] Build host: CYGWIN_NT-5.1 vishal-gpu 1.5.19(0.150/4/2) 2006-01-20 13:28 i686 Cygwin
[10:16:17] Preparing to commence simulation
[10:16:17] - Looking at optimizations...
[10:16:17] - Files status OK
[10:16:17] - Expanded 87055 -> 443705 (decompressed 509.6 percent)
[10:16:17] 
[10:16:17] Project: 2723 (Run 0, Clone 483, Gen 9)
[10:16:17] 
[10:16:17] Assembly optimizations on if available.
[10:16:17] Entering M.D.
[10:16:23] Will resume from checkpoint file
[10:16:23] Working on Protein
[10:16:24] Starting GUI Server
[10:16:31] Resuming from checkpoint
[10:16:31] Verified work/wudata_07.log
[10:16:31] Verified work/wudata_07.edr
[10:16:31] Verified work/wudata_07.trr
[10:16:32] Verified work/wudata_07.xtc
[10:16:32] Completed 81
[10:26:23] Completed 82
[10:36:03] Completed 83
[10:45:32] Completed 84
[10:55:00] Completed 85
[11:04:30] Completed 86
[11:13:57] Completed 87
[11:23:27] Completed 88
[11:32:55] Completed 89
[11:42:24] Completed 90
[11:51:52] Completed 91
[12:01:21] Completed 92
[12:10:49] Completed 93
[12:20:18] Completed 94
[12:29:46] Completed 95
[12:39:16] Completed 96
[12:48:46] Completed 97
[12:58:16] Completed 98
[13:07:45] Completed 99
[13:18:15] 
[13:18:15] Finished Work Unit:
[13:18:15] - Reading up to 45408 from "work/wudata_07.trr": Read 45408
[13:18:15] - Reading up to 971404 from "work/wudata_07.xtc": Read 971404
[13:18:15] logfile size: 34834
[13:18:15] Leaving Run
[13:18:15] - Writing 1052718 bytes of core data to disk...
[13:18:16] Done: 1052206 -> 1017367 (compressed to 96.6 percent)
[13:18:16]   ... Done.
[13:18:16] - Shutting down core
[13:18:16] 
[13:18:16] Folding@home Core Shutdown: FINISHED_UNIT
[13:18:19] CoreStatus = 64 (100)
[13:18:19] Sending work to server


[13:18:19] + Attempting to send results
[13:18:55] + Results successfully sent
[13:18:55] Thank you for your contribution to Folding@Home.
[13:18:55] + Number of Units Completed: 15

Unlike in the case of Endrik, my GPU is a single core processor and there is no possibility that your server might be confused or misled by user or machine ID's.
Please, do something to fix the issue of missing points because it is most irritating, especially for people who are new to the project.
bruce
Posts: 20824
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: General server issues?

Post by bruce »

endrik wrote:Multiple uploads represent my manual efforts of getting it through, and that 2417 was my illfated sneakernetting, as I suspected (good to know that the results DID reach Stanford - that's what matters in the end).
I did some editing in the logs above in order to shorten them, but hope all crucial shows.
Thank you for your reaction - I can get the other complainers here too, but don't want to drown you. One prolific cruncher of ours says he is just used to getting some 10% of his contribution blown in the Universe...
You didn't post the log showing the downloads. You may not be able to find them because you're not keeping all the critical files together. If that's true, then you're not doing the sneakernetting correctly and you will get duplicate assignments which result in lost credits.

Each MachineID must return the previous results (or at least have them waiting to upload) BEFORE that machineID can be allowed to download a new assignment. Keeping queue.dat and FAHlog.txt and FAHlog-Prev.txt and WORK and client.cfg together at all times will prevent that problem.

The upload as UserID= "_" TeamNumber=0 was caused because you edited (or lost) client.cfg. Changes to that file must only be made with the client (using -config or -configonly).
bruce
Posts: 20824
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: General server issues?

Post by bruce »

Obcy_from_Poznan wrote:There must be some severe issues with the FAH servers because I also noted that I had been not granted points for a couple of WUs.
As of 8 hours ago, there were at least two servers which were accepting uploads but not the stats reports were not being received by the stats server. Being Sunday Morning, I don't know when the issue will be corrected, but if you happen to be involved with one of those servers, the credits are just delayed.

If you have been unable to upload your results, that's an entirely different issue. If the WU expired or had an error, there may be a record indicating what happened, but we would need the Proj/Run/Clone/Gen numbers.
The second case when I was granted no points took place the day before yesterday, i.e. on January the 4th, and luckily I have the logs:

Code: Select all

--- Opening Log file [January 4 10:16:16] 


# Windows Graphical GPU Edition ###############################################
###############################################################################

                       Folding@Home Client Version 5.91beta6

                          http://folding.stanford.edu

###############################################################################
###############################################################################

Launch directory: C:\FX\FAH5.91beta6-GPU-GUI


[10:16:16] - Ask before connecting: No
[10:16:16] - User name: Obcy_from_Poznan (Team 276)
[10:16:16] - User ID: 37C8D734257148D2
[10:16:16] - Machine ID: 1
[10:16:16] 
[10:16:16] Loaded queue successfully.
[10:16:16] Initialization complete
[10:16:16] + Benchmarking ...
[10:16:17] 
[10:16:17] + Processing work unit
[10:16:17] Core required: FahCore_10.exe
[10:16:17] Core found.
[10:16:17] Working on Unit 07 [January 4 10:16:17]
[10:16:17] + Working ...
[10:16:17] 
[10:16:17] *------------------------------*
[10:16:17] Folding@Home GPU Core - Beta
[10:16:17] Version 0.10 (Mon Oct 30 12:32:17 PST 2006)
[10:16:17] 
[10:16:17] Compiler  : Microsoft (R) 32-bit C/C++ Optimizing Compiler Version 13.10.3077 for 80x86
[10:16:17] Build host: CYGWIN_NT-5.1 vishal-gpu 1.5.19(0.150/4/2) 2006-01-20 13:28 i686 Cygwin
[10:16:17] Preparing to commence simulation
[10:16:17] - Looking at optimizations...
[10:16:17] - Files status OK
[10:16:17] - Expanded 87055 -> 443705 (decompressed 509.6 percent)
[10:16:17] 
[10:16:17] Project: 2723 (Run 0, Clone 483, Gen 9)
[10:16:17] 
[10:16:17] Assembly optimizations on if available.
[10:16:17] Entering M.D.
[10:16:23] Will resume from checkpoint file
[10:16:23] Working on Protein
[10:16:24] Starting GUI Server
[10:16:31] Resuming from checkpoint
[10:16:31] Verified work/wudata_07.log
[10:16:31] Verified work/wudata_07.edr
[10:16:31] Verified work/wudata_07.trr
[10:16:32] Verified work/wudata_07.xtc
[10:16:32] Completed 81
[10:26:23] Completed 82
[10:36:03] Completed 83
[10:45:32] Completed 84
[10:55:00] Completed 85
[11:04:30] Completed 86
[11:13:57] Completed 87
[11:23:27] Completed 88
[11:32:55] Completed 89
[11:42:24] Completed 90
[11:51:52] Completed 91
[12:01:21] Completed 92
[12:10:49] Completed 93
[12:20:18] Completed 94
[12:29:46] Completed 95
[12:39:16] Completed 96
[12:48:46] Completed 97
[12:58:16] Completed 98
[13:07:45] Completed 99
[13:18:15] 
[13:18:15] Finished Work Unit:
[13:18:15] - Reading up to 45408 from "work/wudata_07.trr": Read 45408
[13:18:15] - Reading up to 971404 from "work/wudata_07.xtc": Read 971404
[13:18:15] logfile size: 34834
[13:18:15] Leaving Run
[13:18:15] - Writing 1052718 bytes of core data to disk...
[13:18:16] Done: 1052206 -> 1017367 (compressed to 96.6 percent)
[13:18:16]   ... Done.
[13:18:16] - Shutting down core
[13:18:16] 
[13:18:16] Folding@home Core Shutdown: FINISHED_UNIT
[13:18:19] CoreStatus = 64 (100)
[13:18:19] Sending work to server


[13:18:19] + Attempting to send results
[13:18:55] + Results successfully sent
[13:18:55] Thank you for your contribution to Folding@Home.
[13:18:55] + Number of Units Completed: 15

Unlike in the case of Endrik, my GPU is a single core processor and there is no possibility that your server might be confused or misled by user or machine ID's.
Please, do something to fix the issue of missing points because it is most irritating, especially for people who are new to the project.
I can confirm that this WU was uploaded and received zero points:
Hi Obcy_from_Poznan (team 276),
Your WU (P2723 R0 C483 G9) was added to the stats database on 2008-01-04 06:54:27 for 0 points of credit.

The error code indicates that the WU may have expired or may have been uploaded twice. Additional information may be available to the Pande Group through a manual search of the logs but they generally don't do that, especially when you're running a beta client and the WU has been reassigned to someone else and completed by them.

When did you download that WU?
endrik
Posts: 34
Joined: Mon Dec 10, 2007 10:41 pm
Location: Wroclaw, Poland
Contact:

Re: General server issues?

Post by endrik »

bruce wrote:You didn't post the log showing the downloads.
Because you didn't ask - and I was focused on the uploads only :) Well then, let me see... here are the beginnings:
--- Opening Log file [January 1 14:53:28]
[14:53:31] - Connecting to assignment server
[14:53:32] - Successful: assigned to (171.64.65.65).
[14:53:32] + News From Folding@Home: Welcome to Folding@Home
[14:53:32] Loaded queue successfully.
[14:54:54] + Connections closed: You may now disconnect(...)
[15:01:46] Working on Unit 01 [January 1 15:01:46]
[15:01:46] - Created dyn
[15:01:46] - Files status OK
[15:01:53] - Expanded 3633499 -> 18750173 (decompressed 516.0 percent)
[15:01:53] - Starting from initial work packet
[15:01:53] Project: 2620 (Run 49, Clone 96, Gen 4)
[15:01:54] Entering M.D.
[15:02:02] Protein: p2620_p1475_tet1_03_1 t= 20000.00000
--- Opening Log file [December 28 16:26:08]
[16:26:11] - Connecting to assignment server
[16:26:13] - Successful: assigned to (171.64.65.65).
[16:26:13] + News From Folding@Home: Welcome to Folding@Home
[16:26:13] Loaded queue successfully.
[16:28:15] + Connections closed: You may now disconnect (...)
[19:39:54] Working on Unit 01 [December 28 19:39:54]
[19:39:54] - Created dyn
[19:39:54] - Files status OK
[19:40:00] - Starting from initial work packet
[19:40:00] Project: 2606 (Run 35, Clone 0, Gen 135)
[19:40:08] Protein: p2606_tet_1499
[19:40:15] Writing local files
[19:40:15] Completed 0 out of 125000 steps (0)(...)
--- Opening Log file [January 2 18:52:45]
[18:54:30] + Attempting to get work packet
[18:54:30] - Connecting to assignment server
[18:54:31] - Successful: assigned to (171.64.122.139).
[18:54:31] + News From Folding@Home: Welcome to Folding@Home
[18:54:31] Loaded queue successfully.
[18:54:46] + Closed connections
[18:54:46] + Processing work unit (...) (downloading core)
[18:55:13] Preparing to commence simulation
[18:55:13] - Created dyn
[18:55:13] - Files status OK
[18:55:14] - Starting from initial work packet
[18:55:14] Project: 2566 (Run 73, Clone 88, Gen 12)
[18:55:14] Entering M.D.
[18:55:21] Protein: p2566_BBA5_ext
[19:01:20] Completed 0 out of 200000 steps (0)(...)
[23:56:59] Working on Unit 03 [January 5 23:56:59]
[23:57:00] Project: 2566 (Run 73, Clone 88, Gen 12)
[23:57:20] Protein: p2566_BBA5_ext
[23:59:30] Completed 198000 out of 200000 steps (99)
[00:14:44] Completed 200000 out of 200000 steps (100)
[00:14:44] Writing final coordinates. (...)
[00:16:09] - Error: Could not transmit unit 03 (completed January 6) to work server.
[00:16:09] Keeping unit 03 in queue.
[00:16:09] + Attempting to send results
[00:16:41] + Results successfully sent
[00:16:41] Thank you for your contribution to Folding@Home.
You were worrying:
You may not be able to find them because you're not keeping all the critical files together.

Well, each client has it's own folder and I am just transferring these folders (cores and all), not playing around with individual files. All instances bear the same User ID of 73AE5F0E346697F4, only Machine ID differ. I think I used to change these Machine IDs (just editing client.cfg as it is easier than running config) - am I to understand that specific Wu is assigned not only to a specific User ID, but also Machine ID and any change will result in confusing server? I was sure these IDs were there just to avoid potential conflicts of several clients on the same machine and other than that they are not that relevant. If I am mistaken in this and server remembers to which Machine ID a WU is assigned, then we got the explanation.

So far I took the credit points as a visible sign that all is OK, my results were accepted and will be utilized - that's why I was worried. If it is possible that the results are accepted even as the Wus are not credited - never mind the whole story.

So: what exactly does it mean when you write "I can see that it was uploaded but no credit was granted"? How is such a situation generally possible? Will downloading a Wu to a Machine ID that is crunching on another PC erase that previous WU server-side?

Sorry for all this fuss - I can only promise that once informed I wont be troubling you again, will set -local flags properly and so on; and let's hope someone else will read this one day and got educated too ;)
yours,
endrik

*Bookworms will rule the world
(after we finish the background reading).
bruce
Posts: 20824
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: General server issues?

Post by bruce »

endrik wrote:I think I used to change these Machine IDs (just editing client.cfg as it is easier than running config) - am I to understand that specific Wu is assigned not only to a specific User ID, but also Machine ID and any change will result in confusing server? I was sure these IDs were there just to avoid potential conflicts of several clients on the same machine and other than that they are not that relevant. If I am mistaken in this and server remembers to which Machine ID a WU is assigned, then we got the explanation.
First, the client.cfg may look like an editable ASCII file, but it's not -- it's a machine-readable file. When FAH detects a change in format it discards the file and builds a set of default values. That's were the Team 0 came from. It only takes a minute to run -configonly and it saves a lot of potential grief -- but you shouldn't need to do that anyway. Move the files as a group if you need to.

Yes, the server knows exactly which WU has been assigned to each client (Clients are identified by (UserID+MachineID). If you change the MachineID, you need to move the associated WU to which ever client gets that MachineID. That's what I meant about the client.cfg with the queue.dat and WORK files -- specifically, the MachineID inside of the client.cfg.

Example: Suppose the server assigns a WU "A" to MachineID=1. That client starts processing it. Meantime you modify the MachineIDs or move the WUs in such a way that the WU is now running on MachineID=2 and some other client is now MachineID=1. Whatever is running on the new MachineID=1 finishes and now MachineID requests a new assignment. From the perspective of the server, WU "A" was assigned to that machine and the result has not been uploaded. Moreover it's reasonable to assume that WU "A" has been lost -- otherwise MachineID=1 wouldn't be asking for a new assignment . . . so reissue WU "A" to MachineID=1 again since the WU has obviously been corrupted or lost and it still needs to be completed. (Of course the server doesn't know that the first copy of WU "A" is still being worked on somewhere else.) Now two of your clients are working on the same assignment. The first one that finishes will get credit and the second one will get zero credit because it's a duplicate.
So far I took the credit points as a visible sign that all is OK, my results were accepted and will be utilized - that's why I was worried. If it is possible that the results are accepted even as the WUs are not credited - never mind the whole story.
It's possible that a WU may be accepted and the credits are delayed. We had a case of that today but the credits should be showing up about now. See the posts regarding server .138.
So: what exactly does it mean when you write "I can see that it was uploaded but no credit was granted"? How is such a situation generally possible? Will downloading a Wu to a Machine ID that is crunching on another PC erase that previous WU server-side?
WUs can be uploaded and rejected for a number of reasons. The most obvious is when a duplicate WU is uploaded. Data corruption is another possibility but there are other possibilities. Results which are correct but Incomplete are generally awarded partial credit.

A WU is "erased" from the download queue when the corresponding result is uploaded. If the result contains a error or is incomplete, the WU is reissued to someone else. If the result has been completed, a new WU with Gen (N+1) is generated.
Sorry for all this fuss - I can only promise that once informed I wont be troubling you again, will set -local flags properly and so on; and let's hope someone else will read this one day and got educated too ;)
endrik
Posts: 34
Joined: Mon Dec 10, 2007 10:41 pm
Location: Wroclaw, Poland
Contact:

Re: General server issues?

Post by endrik »

bruce wrote:First, the client.cfg may look like an editable ASCII file, but it's not -- it's a machine-readable file.
All right - sorry about that and I'll behave in this aspect ;)
But even then, all you say means that when server remembers userID/machineID cofiguration, it is used to avoid assigning another Wu, so that same WU keeps assigned to specific client. That's clear, but still doesn't explain mysterious disappearance of the three WUs listed above. If they were duplicated, they should be showing as submitted a week earlier or so. As for now, they were accepted but not credited, and it looks like they are generally lost, since it's a week overdue already. So my previous question still holds - what happened to them? Does lack of credit mean that they were ultimately rejected (though received and confirmed, so there was no data corruption!), or they will be scientifically 'usable' ?
I am feeling increasingly sorry for taking your time - just let me know when you think enough is enough :)
yours,
endrik

*Bookworms will rule the world
(after we finish the background reading).
Obcy_from_Poznan
Posts: 11
Joined: Sun Jan 06, 2008 2:06 pm
Hardware configuration: HIS Radeon HD4890
Location: Poland Poznan

Re: General server issues?

Post by Obcy_from_Poznan »

bruce wrote: I can confirm that this WU was uploaded and received zero points:
Hi Obcy_from_Poznan (team 276),
Your WU (P2723 R0 C483 G9) was added to the stats database on 2008-01-04 06:54:27 for 0 points of credit.
This is it! I uploaded a ready WU and received 0 points for it.
BTW: Is the information, that you have just presented me, available to anyone of us "folding crunchers" or just for the F@h admins?
bruce wrote: The error code indicates that the WU may have expired or may have been uploaded twice. Additional information may be available to the Pande Group through a manual search of the logs but they generally don't do that, especially when you're running a beta client and the WU has been reassigned to someone else and completed by them.

When did you download that WU?
I've just double checked my logs and now I am sure that I downloaded the questioned WU the previous day i.e. the 3rd of January, and I am also possitive that it was not uploaded twice. Its expiry's date must have been more than 32 hours, at least (usually 5 days). For your reference and confirmation I am pasting a fragment of my log:

Code: Select all

[13:36:40] - Preparing to get new work unit...
[13:36:40] + Attempting to get work packet
[13:36:40] - Connecting to assignment server
[13:36:42] - Successful: assigned to (171.65.103.159).
[13:36:42] + News From Folding@Home: GPU folding beta
[13:36:42] Loaded queue successfully.
[13:36:45] + Closed connections
[13:36:45] 
[13:36:45] + Processing work unit
[13:36:45] Core required: FahCore_10.exe
[13:36:45] Core found.
[13:36:45] Working on Unit 07 [January 3 13:36:45]
[13:36:45] + Working ...
[13:36:45] 
[13:36:45] *------------------------------*
[13:36:45] Folding@Home GPU Core - Beta
[13:36:45] Version 0.10 (Mon Oct 30 12:32:17 PST 2006)
[13:36:45] 
[13:36:45] Compiler  : Microsoft (R) 32-bit C/C++ Optimizing Compiler Version 13.10.3077 for 80x86
[13:36:45] Build host: CYGWIN_NT-5.1 vishal-gpu 1.5.19(0.150/4/2) 2006-01-20 13:28 i686 Cygwin
[13:36:45] Preparing to commence simulation
[13:36:45] - Looking at optimizations...
[13:36:45] - Created dyn
[13:36:45] - Files status OK
[13:36:45] - Expanded 87055 -> 443705 (decompressed 509.6 percent)
[13:36:45] 
[13:36:45] Project: 2723 (Run 0, Clone 483, Gen 9)
[13:36:45] 
[13:36:45] Assembly optimizations on if available.
[13:36:45] Entering M.D.
[13:36:52] Working on Protein
[13:36:52] Starting GUI Server
[13:46:32] Completed 1
[13:56:03] Completed 2
[14:05:36] Completed 3
(etc.)

Bruce, I've already come to terms with the loss of a few hundred points, but I wish that in future all completed WUs would be granted points. It would be a real pity for the project if new folders see this happen too often. It happened to me twice over 2 weeks, when 15 WUs were completed on ma GPU. I understand that it still is a beta client, but please let the gentlemen responsible for servers and granting points know there is a problem. :)
Thank you for your time and help.
bruce
Posts: 20824
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: General server issues?

Post by bruce »

Obcy_from_Poznan wrote:
bruce wrote: I can confirm that this WU was uploaded and received zero points:
Hi Obcy_from_Poznan (team 276),
Your WU (P2723 R0 C483 G9) was added to the stats database on 2008-01-04 06:54:27 for 0 points of credit.
This is it! I uploaded a ready WU and received 0 points for it.
BTW: Is the information, that you have just presented me, available to anyone of us "folding crunchers" or just for the F@h admins?
It's only available to the Pande Group and the forum moderators. There's an enhancement request for that sort of data on the development list so you may get that capability someday, but probably not soon. There are just too many other nice things on the suggestion list.
Obcy_from_Poznan wrote:
bruce wrote: The error code indicates that the WU may have expired or may have been uploaded twice. Additional information may be available to the Pande Group through a manual search of the logs but they generally don't do that, especially when you're running a beta client and the WU has been reassigned to someone else and completed by them.

When did you download that WU?
I've just double checked my logs and now I am sure that I downloaded the questioned WU the previous day i.e. the 3rd of January, and I am also possitive that it was not uploaded twice. Its expiry's date must have been more than 32 hours, at least (usually 5 days). For your reference and confirmation I am pasting a fragment of my log:
<snip>
Bruce, I've already come to terms with the loss of a few hundred points, but I wish that in future all completed WUs would be granted points. It would be a real pity for the project if new folders see this happen too often. It happened to me twice over 2 weeks, when 15 WUs were completed on ma GPU. I understand that it still is a beta client, but please let the gentlemen responsible for servers and granting points know there is a problem. :)
I agree. At this point, I've told you all I know. I'll pass the data on to the Pande Group, but I have no way of knowing when they'll be able to look at it or what the outcome might be.
bruce
Posts: 20824
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: General server issues?

Post by bruce »

endrik wrote:That's clear, but still doesn't explain mysterious disappearance of the three WUs listed above. If they were duplicated, they should be showing as submitted a week earlier or so. As for now, they were accepted but not credited, and it looks like they are generally lost, since it's a week overdue already. So my previous question still holds - what happened to them? Does lack of credit mean that they were ultimately rejected (though received and confirmed, so there was no data corruption!), or they will be scientifically 'usable' ?
I am feeling increasingly sorry for taking your time - just let me know when you think enough is enough :)
I spend my time here by choice. FAH is a good project -- worth whatever I can give to it.

The missing WUs may be the same situation that I just reported to the Pande Group for Obcy_from_Poznan -- or it may be something else. I'll recheck your data later (I have to go to a meeting right now) and see if there's anything else I can figure out from what they let me see. Unfortunately it's not as good a candidate for the Pande Group to research since the sneakernetting issues make it very difficult to know what actually happened. (And technically, sneakernetting is not "supported")
endrik
Posts: 34
Joined: Mon Dec 10, 2007 10:41 pm
Location: Wroclaw, Poland
Contact:

Re: General server issues?

Post by endrik »

bruce wrote: I'll recheck your data later (I have to go to a meeting right now) and see if there's anything else I can figure out from what they let me see. Unfortunately it's not as good a candidate for the Pande Group to research since the sneakernetting issues make it very difficult to know what actually happened. (And technically, sneakernetting is not "supported")
The good news is, we don't need a detailed investigation as to the reasons - if sneakernetting is not supported, then it it is not supported and that's it :) I'll be more careful and let's forget the matter.
I would like to know only what I wrote earlier:
So far I took the credit points as a visible sign that all is OK, my results were accepted and will be utilized - that's why I was worried. Does lack of credit mean that they were ultimately rejected (though received and confirmed, so there was no data corruption!), or they will be scientifically 'usable' ? If it is possible that the results are accepted even as the Wus are not credited - never mind the whole story.
yours,
endrik

*Bookworms will rule the world
(after we finish the background reading).
bruce
Posts: 20824
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: General server issues?

Post by bruce »

endrik wrote:So far I took the credit points as a visible sign that all is OK, my results were accepted and will be utilized - that's why I was worried. Does lack of credit mean that they were ultimately rejected (though received and confirmed, so there was no data corruption!), or they will be scientifically 'usable' ? If it is possible that the results are accepted even as the Wus are not credited - never mind the whole story.
FAH is designed to protect itself from missing scientific results. One aspect of that design is reissuing WUs which pass the Preferred Deadline. Another aspect of that design is reissuing a WU to a client which appears to have lost the WU.
In the case of Project 2620, Run 49, Clone 96, Gen 4 it appears that both aspects were activated. Whether credit is issued for the second or third submittal or not, the scientific need is satisfied by one of them which happened to be somebody else, four days earlier. (This WU would never have been issued to you if xxxxxxx had returned the WU within the deadline. That's one reason we stress that WUs must be returned promptly.)
Hi xxxxxxx (team 0), UserID: 03CCXXXXXXXXXXXXXX
Your WU (P2620 R49 C96 G4) was added to the stats database on 2008-01-01 19:03:44 for 292 points of credit.
Hi endrik (team 276), UserID: 73AEXXXXXXXXXXXXXX
Your WU (P2620 R49 C96 G4) was added to the stats database on 2008-01-05 07:02:19 for 0 points of credit.
Hi _ (team 0), UserID: 73AEXXXXXXXXXXXXXX
Your WU (P2620 R49 C96 G4) was added to the stats database on 2008-01-05 19:01:02 for 0 points of credit.

As I said earlier, Project 2606, Run 35, Clone 0, Gen 135 is virtually the same.
Hi xxxxxx (team xxx),
Your WU (P2606 R35 C0 G135) was added to the stats database on 2007-12-29 11:02:02 for 292 points of credit.
Hi endrik (team 276),
Your WU (P2606 R35 C0 G135) was added to the stats database on 2008-01-02 09:04:31 for 0 points of credit.
Hi endrik (team 276),
Your WU (P2606 R35 C0 G135) was added to the stats database on 2008-01-05 19:01:02 for 0 points of credit.
brityank
Posts: 161
Joined: Wed Dec 05, 2007 9:16 pm
Location: SE Pennsylvania

Re: General server issues?

Post by brityank »

endrik and bruce --

Thank you both for an interesting discussion of the behind the scenes actions of the infrastructure. :D
... ... Free Republic Folders - A Tribute to Ronald Reagan ... ...
Image
Post Reply