Page 1 of 1

2605 (multiple)

Posted: Thu Apr 24, 2008 11:20 am
by PlayLoud
All of the sudden, I am having trouble starting work units. I am running Ubuntu, and SMP 6.02 beta 1. My last work unit finished with no problems. However, tonight I tried to start another, and I keep hanging at the beginning. I have tried deleting the work unit, and starting over, and even deleting Folding@home, redownloading, and reinstalling. I still get the same problem. Even though they were all project 2605, the run/clone/gen were different.

Please help.

Dell Inspiron 1520
Intel Core 2 Duo T7100 - 1.8Ghz
2048MB RAM
Ubuntu 7.10 - 64 bit

Code: Select all

11:10:23] Verifying core Core_a1.fah...
[11:10:23] Signature is VALID
[11:10:23] 
[11:10:23] Trying to unzip core FahCore_a1.exe
[11:10:24] Decompressed FahCore_a1.exe (3625104 bytes) successfully
[11:10:24] + Core successfully engaged
[11:10:29] 
[11:10:29] + Processing work unit
[11:10:29] Core required: FahCore_a1.exe
[11:10:29] Core found.
[11:10:29] Working on Unit 01 [April 24 11:10:29]
[11:10:29] + Working ...
[11:10:29] - Calling './mpiexec -np 4 -host 127.0.0.1 ./FahCore_a1.exe -dir work/ -suffix 01 -checkpoint 15 -forceasm -verbose -lifeline 7887 -version 602'

[11:10:29] 
[11:10:29] *------------------------------*
[11:10:29] Folding@Home Gromacs SMP Core
[11:10:29] Version 1.74 (November 27, 2006)
[11:10:29] 
[11:10:29] Preparing to commence simulation
[11:10:29] - Ensuring status. Please wait.
[11:10:29] Finalizing output
[11:10:30] - Expanded 2435181 -> 12885597 (decompressed 529.1 percent)
[11:10:30] - Starting from initial work packet
[11:10:30] 
[11:10:30] Project: 2605 (Run 13, Clone 205, Gen 41)
[11:10:30] 
[11:10:30] Assembly optimizations on if available.
[11:10:30] Entering M.D.
[11:10:40] CoreStatus = 0 (0)
[11:10:40] Client-core communications error: ERROR 0x0
[11:10:40] Deleting current work unit & continuing...
[11:12:44] ***** Got an Activate signal (2)
[11:12:44] Killing all core threads

Folding@Home Client Shutdown.
[11:12:44] - Warning: Could not delete all work unit files (1): Core file absent
[11:12:44] Trying to send all finished work units
[11:12:44] + No unsent completed units remaining.
[11:12:44] - Preparing to get new work unit...
[11:12:44] + Attempting to get work packet
[11:12:44] - Will indicate memory of 2006 MB
[11:12:44] - Connecting to assignment server
[11:12:44] Connecting to http://assign.stanford.edu:8080/

Re: 2605 (multiple)

Posted: Thu Apr 24, 2008 11:30 am
by PlayLoud
Also, when I tried to run it again, it said something about a broken pipe...

Oh, I also tried it with the -forceasm flag off (which I normally have on), but it still didn't work.

Code: Select all

[11:24:08] 
[11:24:08] *------------------------------*
[11:24:08] Folding@Home Gromacs SMP Core
[11:24:08] Version 1.74 (November 27, 2006)
[11:24:08] 
[11:24:08] Preparing to commence simulation
[11:24:08] - Ensuring status. Please wait.
[11:24:25] - Assembly optimizations manually forced on.
[11:24:25] - Not checking prior termination.
[11:24:25] Error: Work unit read from disk is invalid
[11:24:25] Finalizing output
[11:24:26] - Expanded 2435181 -> 12885597 (decompressed 529.1 percent)
[11:24:26] - Starting from initial work packet
[11:24:26] 
[11:24:26] Project: 2605 (Run 13, Clone 205, Gen 41)
[11:24:26] 
[11:24:27] Assembly optimizations on if available.
[11:24:27] Entering M.D.
[unset]: write_line error; fd=8 buf=:cmd=response_to_init pmi_version=1 pmi_subversion=1 rc=0
:
system msg for write_line failure : Broken pipe
[unset]: write_line error; fd=14 buf=:cmd=response_to_init pmi_version=1 pmi_subversion=1 rc=0
:
system msg for write_line failure : Broken pipe
[0]0:Return code = 0, signaled with Quit
[0]1:Return code = 0, signaled with Floating point exception
[0]2:Return code = 0, signaled with Quit
[0]3:Return code = 0, signaled with Floating point exception
[11:24:38] CoreStatus = 0 (0)
[11:24:38] Client-core communications error: ERROR 0x0
[11:24:38] Deleting current work unit & continuing...

Re: 2605 (multiple)

Posted: Thu Apr 24, 2008 6:22 pm
by PlayLoud
Is anybody able to help with this problem? It now seems to be every work unit I try with this machine (always pulls 2605). I would think there is something wrong with the machine, but it did the last work unit just fine. Is there anything about the errors received that can tell what is going on? My 2nd best machine is basicly down until I figure this out.

Re: 2605 (multiple)

Posted: Thu Apr 24, 2008 6:43 pm
by Ren02
Have you tried to remove the -smp flag and let it run as a uniproc client as a test? If that doesn't work either then it looks like a hardware issue. If that works, then there might be something wrong with the network interface. Unfortunately I'm not an Ubuntu expert though. :(

Re: 2605 (multiple)

Posted: Thu Apr 24, 2008 7:12 pm
by PlayLoud
I took out the -smp flag, and it runs fine.

I don't know exactly what that means though. SMP was working just a couple of days ago. I have changed nothing. :cry:

Re: 2605 (multiple)

Posted: Thu Apr 24, 2008 8:28 pm
by bruce
If you were running Windows, I'd tell you to re-register your password, but that doesn't apply to Linux.

Assuming you've got N-copies of the non-smp client running, check the temperature. It may be that enough dust has accumulated in your heat-sink that your CPU (or RAM, or chip-set) finally passed the critical temperature at which the system remains stable.

Re: 2605 (multiple)

Posted: Thu Apr 24, 2008 8:57 pm
by PlayLoud
I checked for dust just a couple of weeks ago. It's clean, and temps are good.

Re: 2605 (multiple)

Posted: Thu Apr 24, 2008 9:30 pm
by Ren02
When you type ifconfig in the console, does it report any errors or dropped packets?

Re: 2605 (multiple)

Posted: Thu Apr 24, 2008 11:15 pm
by bruce
There should be an entry in /etc/hosts that says
127.0.0.1 localhost <yourhostname>
and you should get the same results from all three of
ping 127.0.0.1
ping localhost
ping <yourhostname>

Re: 2605 (multiple)

Posted: Fri Apr 25, 2008 1:32 am
by PlayLoud
Ren02 wrote:When you type ifconfig in the console, does it report any errors or dropped packets?
None.

Re: 2605 (multiple)

Posted: Fri Apr 25, 2008 1:36 am
by PlayLoud
bruce wrote:There should be an entry in /etc/hosts that says
127.0.0.1 localhost <yourhostname>
and you should get the same results from all three of
ping 127.0.0.1
ping localhost
ping <yourhostname>
Yes, they all ping fine.

Re: 2605 (multiple)

Posted: Fri Apr 25, 2008 1:38 am
by PlayLoud
If I can't figure this out tonight (another 6.5 hours until the single core WU I started is complete), I am probably going to wipe the install of linux, as Ubuntu 8.04 has just been released anyway. Perhaps a fresh install of the OS could help.

Re: 2605 (multiple)

Posted: Sun Apr 27, 2008 7:13 pm
by PlayLoud
Well, I deleted Ubuntu 7.10, and installed Ubuntu 8.04.

I can start SMP WUs again, so that is good. Although, now I have a new problem, in that I can't fold very fast. Speedsteep is throttling down my processor, even though it should be full blast right now. Project 2605 is taking over 40 minutes/frame. If I disable speedstep, I get 30 minutes/frame. I should be getting about 20 minutes/frame. (disabling speedstep in Ubuntu 7.10 also slowed the project down for some reason).

I started a new thread for my new problem...

viewtopic.php?f=12&t=2342

Re: 2605 (multiple) Fixed

Posted: Sat May 10, 2008 9:58 pm
by DreadedOne509
I'm having this exact same issue with various P. 2605's (different RCG's as well).
Ubuntu 6.10, Intel C2D 6400 originally at 2.9 GHz, dropped to 2.4 GHz.
No dust, heat o.k., memtest stable for 3 passes (even oc'd).

Just stops after Entering M.D. and does nothing.

As an edit: I found several other posts regarding this issue, some covering linux
(Ubuntu specifically) and Mac OS X. All seem to point to the etc/hosts file as
being the culprit, or permissions therein. My host file looks like this:

127.0.0.1 localhost
127.0.0.1 Dakota.APACHE

Then the ip6 lines. I've tried it with them all (ip6) uncommented, and commented out
with networking restarts to no avail.

FIXED -

Changing the etc/hosts file to look like this fixed it:

127.0.0.1 localhost Dakota
127.0.0.1 Dakota.APACHE