.
I 'd suggest that such WU's are kept within Stanford's walls to be run and checked/finished over there in stead of sending them out for months on end ...
That way, troublesome WU's would leave the loop earlier and their faults would be discovered sooner too !
.
Project: 2665 (Run 1, Clone 649, Gen 6)
Moderators: Site Moderators, FAHC Science Team
-
- Posts: 270
- Joined: Sun Dec 02, 2007 2:26 pm
- Hardware configuration: Folders: Intel C2D E6550 @ 3.150 GHz + GPU XFX 9800GTX+ @ 765 MHZ w. WinXP-GPU
AMD A2X64 3800+ @ stock + GPU XFX 9800GTX+ @ 775 MHZ w. WinXP-GPU
Main rig: an old Athlon Barton 2500+ @2.25 GHz & 2* 512 MB RAM Apacer, Radeon 9800Pro, WinXP SP3+ - Location: Belgium, near the International Sea-Port of Antwerp
Re: Project: 2665 (Run 1, Clone 649, Gen 6)
- stopped Linux SMP w. HT on i7-860@3.5 GHz
....................................
Folded since 10-06-04 till 09-2010
....................................
Folded since 10-06-04 till 09-2010
Re: Project: 2665 (Run 1, Clone 649, Gen 6)
I got this work unit.
Fixed problem. Changed machine ID, and viola! New WU!
Code: Select all
[23:01:53] Initial: 0000; - Receiving payload (expected size: 4659162)
[23:06:35] - Downloaded at ~16 kB/s
[23:06:35] - Averaged speed for that direction ~61 kB/s
[23:06:35] + Received work.
[23:06:35] Trying to send all finished work units
[23:06:35] + No unsent completed units remaining.
[23:06:35] + Closed connections
[23:06:35]
[23:06:35] + Processing work unit
[23:06:35] Work type a1 not eligible for variable processors
[23:06:35] Core required: FahCore_a1.exe
[23:06:35] Core found.
[23:06:35] Working on queue slot 04 [August 21 23:06:35 UTC]
[23:06:35] + Working ...
[23:06:35] - Calling 'mpiexec -np 4 -channel shm -env MPICH_USE_SMP_OPTIMIZATIONS 1 -host 127.0.0.1 FahCore_a1.exe -dir work/ -suffix 04 -checkpoint 15 -verbose -lifeline 3764 -version 622'
[23:06:36]
[23:06:36] *------------------------------*
[23:06:36] Folding@Home Gromacs SMP Core
[23:06:36] Version 1.76 (February 23, 2008)
[23:06:36]
[23:06:36] Preparing to commence simulation
[23:06:36] - Ensuring status. Please wait- Created dyn
[23:06:36] - Files status OK
[23:06:36] 4.sas
[23:06:36] - Failed to- Created dyn
[23:06:36] - Files status OK
[23:06:36] ng: check for stray files
[23:06:36] - Created dyn
[23:06:36] - Files status OK
[23:06:48] 65 (Run 1, Clone 649, Gen 6)
[23:06:48]
[23:06:48] packet
[23:06:48]
[23:06:48] Project: 2665 (Run 1, - Starting from initial work packet
[23:06:48]
[23:06:48] Project: 2665 (Run 1, Clone 649, Gen 6)
[23:06:48]
[23:06:51] Assembly optimizations on if available.
[23:06:51] Entering M.D.
[23:07:07] files
[23:07:07] n: IBX in water
[23:07:07] Writing local files
[23:07:09] Extra SSE boost OK.
[23:07:17] Gromacs cannot continue further.
[23:07:17] Going to send back what have done.
[23:07:17] logfil- Failed to d- Writing 9958 bytes - FaNo C.P. to delete.
[23:07:17] - Failed to delete work/wudata_04.bed
[23:07:17] - Failed to delNo C.P. to delete.
[23:07:17] - Failed to delet- Failed to delete work/wudata_04.bed
[23:07:17] - Failed to delete
[23:07:17] Folding@home Core Shutdown: EARLY_UNIT_END
[23:07:17] Finalizing output
[23:07:17] ng: check for stray files
[23:09:17]
[23:09:17] Folding@home Core Shutdown: EARLY_UNIT_END
[23:09:17] Finalizing output
-
- Pande Group Member
- Posts: 2058
- Joined: Fri Nov 30, 2007 6:25 am
- Location: Stanford
Re: Project: 2665 (Run 1, Clone 649, Gen 6)
Our server code will give a WU a certain # of tries and then stop it. We'll look to see if this code is not working in this case or if there's something else going on here.
Re: Project: 2665 (Run 1, Clone 649, Gen 6)
Problem is, the client sits there hung after the WU gets an EUE. It sat there for over two hours before I found it, then I couldn't get rid of it after deleting it several times.
-
- Posts: 33
- Joined: Sun May 25, 2008 7:40 pm
Re: Project: 2665 (Run 1, Clone 649, Gen 6)
+1 more. Most of my Linux boxes and now starting on my Windows machines. This WU stalls the client w/ a seg fault in Linux and results in no further activity (w/o a F@H restart) in Windows.VijayPande wrote:Our server code will give a WU a certain # of tries and then stop it. We'll look to see if this code is not working in this case or if there's something else going on here.
Re: Project: 2665 (Run 1, Clone 649, Gen 6)
Thanks for being patient, SW. Let me check again.
Paula
Paula
-
- Pande Group Member
- Posts: 2058
- Joined: Fri Nov 30, 2007 6:25 am
- Location: Stanford
Re: Project: 2665 (Run 1, Clone 649, Gen 6)
I was hoping the code would take care of this automatically (since we can't kill WU's by hand too frequently), but that's not working. We're working on a server code update to handle that. For now, I've manually killed this WU.
-
- Posts: 33
- Joined: Sun May 25, 2008 7:40 pm
Re: Project: 2665 (Run 1, Clone 649, Gen 6)
Thank you.