UNSTABLE_MACHINE resets GPU at ~1.51%

If you're new to FAH and need help getting started or you have very basic questions, start here.

Moderators: Site Moderators, FAHC Science Team

CaulfieldCap
Posts: 3
Joined: Mon Jun 17, 2013 7:35 pm

UNSTABLE_MACHINE resets GPU at ~1.51%

Post by CaulfieldCap »

I'm very new to Folding@Home, as I've just created my account and installed everything. I apologize if this isn't an exceptional question, but it is one that I can't seem to figure out.

So right now, my GPU (nVidia GeForce 660Ti) is running PRCG 8074 (42, 27, 69). Every time it reaches 1.51 or 1.52 percent complete, it resets back to 0% complete, reaches 1.51 or 1.52 percent complete again, and repeats. How can I fix this, and is there any information anyone would need to help diagnose my problem. Thank you very much!

-CC
bollix47
Posts: 2958
Joined: Sun Dec 02, 2007 5:04 am
Location: Canada

Re: Folding process running on GPU resets at ~1.51%

Post by bollix47 »

Welcome to the folding@home support forum CaulfieldCap.

If you could please supply the log as described here we will try to help.
CaulfieldCap
Posts: 3
Joined: Mon Jun 17, 2013 7:35 pm

Re: Folding process running on GPU resets at ~1.51%

Post by CaulfieldCap »

Sure, here's what shows up in the console as it resets.

As you can see, after is passes 2%, something returns 52 and UNSTABLE_MACHINE comes up...

Then it appears to restart.

Code: Select all

19:48:03:WU02:FS00:0x15:Setting checkpoint frequency: 500000
19:48:03:WU02:FS00:0x15:Completed         3 out of 50000000 steps (0%).
19:48:36:WARNING:Exception: 8:127.0.0.1: Send error: 10053: An established connection was aborted by the software in your host machine.
19:49:22:WU00:FS01:0xa3:Completed 15000 out of 500000 steps  (3%)
19:49:39:12:127.0.0.1:New Web connection
19:50:03:WU02:FS00:0x15:Completed    500000 out of 50000000 steps (1%).
19:52:02:WU02:FS00:0x15:Completed   1000000 out of 50000000 steps (2%).
19:52:02:WU02:FS00:0x15:mdrun_gpu returned 52
19:52:02:WU02:FS00:0x15:NANs detected on GPU
19:52:02:WU02:FS00:0x15:
19:52:02:WU02:FS00:0x15:Folding@home Core Shutdown: UNSTABLE_MACHINE
19:52:02:WARNING:WU02:FS00:FahCore returned: UNSTABLE_MACHINE (122 = 0x7a)
19:52:02:WARNING:WU02:FS00:Too many errors, failing
19:52:02:WU02:FS00:Sending unit results: id:02 state:SEND error:FAILED project:8074 run:42 clone:27 gen:69 core:0x15 unit:0x0000004f6652edb450b430db82969efb
19:52:02:WU02:FS00:Connecting to 171.67.108.36:8080
19:52:02:WU01:FS00:Connecting to assign-GPU.stanford.edu:80
19:52:03:WU02:FS00:Server responded WORK_QUIT (404)
19:52:03:WARNING:WU02:FS00:Server did not like results, dumping
19:52:03:WU02:FS00:Cleaning up
19:52:03:WU01:FS00:News: Welcome to Folding@Home
19:52:03:WU01:FS00:Assigned to work server 171.67.108.36
19:52:03:WU01:FS00:Requesting new work unit for slot 00: READY gpu:0:GK104 [GeForce GTX 660 Ti] from 171.67.108.36
19:52:03:WU01:FS00:Connecting to 171.67.108.36:8080
19:52:04:WU01:FS00:Downloading 59.59KiB
19:52:04:WU01:FS00:Download complete
19:52:04:WU01:FS00:Received Unit: id:01 state:DOWNLOAD error:NO_ERROR project:8074 run:53 clone:27 gen:54 core:0x15 unit:0x0000003a6652edb450b431192c7ad41a
19:52:04:WU01:FS00:Starting
19:52:04:WU01:FS00:Running FahCore: "C:\Program Files (x86)\FAHClient/FAHCoreWrapper.exe" "C:/Users/Ian Zane/AppData/Roaming/FAHClient/cores/www.stanford.edu/~pande/Win32/AMD64/NVIDIA/Fermi/Core_15.fah/FahCore_15.exe" -dir 01 -suffix 01 -version 703 -lifeline 4148 -checkpoint 15 -gpu 0 -gpu-vendor nvidia
19:52:04:WU01:FS00:Started FahCore on PID 6400
19:52:04:WU01:FS00:Core PID:6696
19:52:04:WU01:FS00:FahCore 0x15 started
19:52:05:WU01:FS00:0x15:
bollix47
Posts: 2958
Joined: Sun Dec 02, 2007 5:04 am
Location: Canada

Re: Folding process running on GPU resets at ~1.51%

Post by bollix47 »

As long as your next work unit is progressing properly you should be okay. You might want to check your temperatures to ensure they're not too high. If you have this type of error every time on your 660ti you could also run a memory test.

There are no returns in the database for project:8074 run:42 clone:27 gen:69 at this time. I will mark it for followup.

Status 7A
CaulfieldCap
Posts: 3
Joined: Mon Jun 17, 2013 7:35 pm

Re: Folding process running on GPU resets at ~1.51%

Post by CaulfieldCap »

My GPU is actually running relatively cool, at only 65 C (in game it runs at ~84 C). What do you mean by my next work unit? I'm sorry, I'm very new. Will I ever reach my next work unit if I don't complete this one?

Also, after doing a bit more research, I found that the issue may have come up due to an unstable overclock on my GPU. I've un-overclocked it now, and it seems to be working better (currently at 4.09%). I'll post later in this thread if it remains stable.
bollix47
Posts: 2958
Joined: Sun Dec 02, 2007 5:04 am
Location: Canada

Re: Folding process running on GPU resets at ~1.51%

Post by bollix47 »

What do you mean by my next work unit? I'm sorry, I'm very new. Will I ever reach my next work unit if I don't complete this one?
You received a different work unit after the one you reported(project:8074 run:42 clone:27 gen:69) was dumped: project:8074 run:53 clone:27 gen:54

Sometimes the same work unit (same PRCG numbers) does get returned to you for processing if it failed. That can happen a number of times (I think up to 5) before the server realizes it's not going to get this work unit back from you so it moves on to a different work unit which is what you're working on now and since it's working better the failed one could very well have been a bad work unit. Or it could have been a bad overclock. We'll know more if and when the work unit is returned by someone else.
bruce
Posts: 20824
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: Folding process running on GPU resets at ~1.51%

Post by bruce »

FAH is known to "push" GPUs harder than many other applications which has often meant that unstable overclocks are "discovered" by FAH.
bollix47
Posts: 2958
Joined: Sun Dec 02, 2007 5:04 am
Location: Canada

Re: UNSTABLE_MACHINE resets GPU at ~1.51%

Post by bollix47 »

The work unit was completed by another folder:

Hi ***** (team *****),
Your WU (P8074 R42 C27 G69) was added to the stats database on 2013-06-17 21:00:09 for 3874 points of credit.

Looks more likely that the overclock may have caused the problem.

Followup report closed.
ntsarb
Posts: 10
Joined: Fri Sep 13, 2013 6:38 pm

Re: UNSTABLE_MACHINE resets GPU at ~1.51%

Post by ntsarb »

I think I'm dealing with the same problem. Any help would be appreciated.

The machine is based on i7 860 (stock clock), 2 x MSI GTX 660 Twin Frozr (stock clock), 4 x 4 GB DDR3 1333MHz RAM

Here's the Log:

Code: Select all

05:21:12:WARNING:WU02:FS00:FahCore returned: UNSTABLE_MACHINE (122 = 0x7a)
05:45:04:WARNING:WU01:FS01:FahCore returned: UNSTABLE_MACHINE (122 = 0x7a)
05:45:06:WARNING:WU03:FS01:Failed to get assignment from 'assign-GPU.stanford.edu:80': Empty work server assignment
05:45:06:WARNING:WU03:FS01:Failed to get assignment from 'assign-GPU.stanford.edu:8080': Empty work server assignment
05:45:06:ERROR:WU03:FS01:Exception: Could not get an assignment
05:45:07:WARNING:WU03:FS01:Failed to get assignment from 'assign-GPU.stanford.edu:80': Empty work server assignment
05:45:08:WARNING:WU03:FS01:Failed to get assignment from 'assign-GPU.stanford.edu:8080': Empty work server assignment
05:45:08:ERROR:WU03:FS01:Exception: Could not get an assignment
05:46:07:WARNING:WU03:FS01:Failed to get assignment from 'assign-GPU.stanford.edu:80': Empty work server assignment
05:46:08:WARNING:WU03:FS01:Failed to get assignment from 'assign-GPU.stanford.edu:8080': Empty work server assignment
05:46:08:ERROR:WU03:FS01:Exception: Could not get an assignment
08:14:03:WARNING:WU02:FS00:Detected clock skew (1 mins 06 secs), adjusting time estimates
08:14:03:WARNING:WU03:FS01:Detected clock skew (1 mins 06 secs), adjusting time estimates
******************************* Date: 2013-09-12 *******************************
10:33:25:WARNING:WU02:FS00:FahCore returned: UNSTABLE_MACHINE (122 = 0x7a)
10:34:31:WARNING:WU02:FS00:Detected clock skew (1 mins 06 secs), adjusting time estimates
11:03:29:WARNING:WU02:FS00:FahCore returned: UNSTABLE_MACHINE (122 = 0x7a)
11:04:35:WARNING:WU02:FS00:Detected clock skew (1 mins 05 secs), adjusting time estimates
14:10:23:WARNING:WU02:FS00:FahCore returned: UNSTABLE_MACHINE (122 = 0x7a)
14:11:29:WARNING:WU02:FS00:Detected clock skew (1 mins 05 secs), adjusting time estimates
******************************* Date: 2013-09-12 *******************************
19:09:20:WARNING:WU01:FS01:Detected clock skew (1 mins 05 secs), adjusting time estimates
19:09:21:WARNING:WU02:FS00:Detected clock skew (1 mins 06 secs), adjusting time estimates
******************************* Date: 2013-09-13 *******************************
02:36:57:WARNING:WU01:FS01:Detected clock skew (1 mins 08 secs), adjusting time estimates
02:36:58:WARNING:WU02:FS00:Detected clock skew (1 mins 09 secs), adjusting time estimates
04:09:03:WARNING:WU02:FS00:FahCore returned: UNSTABLE_MACHINE (122 = 0x7a)
04:10:11:WARNING:WU02:FS00:Detected clock skew (1 mins 08 secs), adjusting time estimates
04:26:53:WARNING:WU02:FS00:FahCore returned: UNSTABLE_MACHINE (122 = 0x7a)
04:28:00:WARNING:WU02:FS00:Detected clock skew (1 mins 07 secs), adjusting time estimates
06:19:36:WARNING:WU02:FS00:FahCore returned: UNSTABLE_MACHINE (122 = 0x7a)
06:20:43:WARNING:WU02:FS00:Detected clock skew (1 mins 07 secs), adjusting time estimates
06:24:08:WARNING:WU03:FS01:FahCore returned: UNSTABLE_MACHINE (122 = 0x7a)
06:25:13:WARNING:WU03:FS01:Detected clock skew (1 mins 05 secs), adjusting time estimates
06:51:24:WARNING:WU02:FS00:FahCore returned: UNSTABLE_MACHINE (122 = 0x7a)
06:52:30:WARNING:WU02:FS00:Detected clock skew (1 mins 06 secs), adjusting time estimates
08:31:42:WARNING:WU03:FS01:FahCore returned: UNSTABLE_MACHINE (122 = 0x7a)
08:32:48:WARNING:WU03:FS01:Detected clock skew (1 mins 05 secs), adjusting time estimates
******************************* Date: 2013-09-13 *******************************
09:20:13:WARNING:WU03:FS01:FahCore returned: UNSTABLE_MACHINE (122 = 0x7a)
10:07:15:WARNING:WU03:FS01:FahCore returned: UNSTABLE_MACHINE (122 = 0x7a)
10:08:21:WARNING:WU03:FS01:Detected clock skew (1 mins 05 secs), adjusting time estimates
10:12:34:WARNING:WU02:FS00:FahCore returned: UNSTABLE_MACHINE (122 = 0x7a)
10:12:34:WARNING:WU02:FS00:Too many errors, failing
10:12:35:WARNING:WU02:FS00:Server did not like results, dumping
10:14:49:WARNING:WU03:FS01:FahCore returned: UNSTABLE_MACHINE (122 = 0x7a)
10:14:49:WARNING:WU03:FS01:Too many errors, failing
10:14:50:WARNING:WU03:FS01:Server did not like results, dumping
10:16:38:WARNING:WU01:FS00:FahCore has not changed since last download, aborting core update
10:32:30:WARNING:WU01:FS00:FahCore returned: UNSTABLE_MACHINE (122 = 0x7a)
10:33:37:WARNING:WU01:FS00:Detected clock skew (1 mins 06 secs), adjusting time estimates
10:48:30:WARNING:WU01:FS00:FahCore returned: UNSTABLE_MACHINE (122 = 0x7a)
10:49:37:WARNING:WU01:FS00:Detected clock skew (1 mins 06 secs), adjusting time estimates
11:03:27:WARNING:WU01:FS00:FahCore returned: UNSTABLE_MACHINE (122 = 0x7a)
11:04:33:WARNING:WU01:FS00:Detected clock skew (1 mins 06 secs), adjusting time estimates
11:11:57:WARNING:WU01:FS00:FahCore returned: UNSTABLE_MACHINE (122 = 0x7a)
11:13:02:WARNING:WU01:FS00:Detected clock skew (1 mins 04 secs), adjusting time estimates
11:20:38:WARNING:WU01:FS00:FahCore returned: UNSTABLE_MACHINE (122 = 0x7a)
11:20:38:WARNING:WU01:FS00:Too many errors, failing
11:20:39:WARNING:WU01:FS00:Server did not like results, dumping
******************************* Date: 2013-09-13 *******************************
Mod edit: Added Code tags to log
P5-133XL
Posts: 2948
Joined: Sun Dec 02, 2007 4:36 am
Hardware configuration: Machine #1:

Intel Q9450; 2x2GB=8GB Ram; Gigabyte GA-X48-DS4 Motherboard; PC Power and Cooling Q750 PS; 2x GTX 460; Windows Server 2008 X64 (SP1).

Machine #2:

Intel Q6600; 2x2GB=4GB Ram; Gigabyte GA-X48-DS4 Motherboard; PC Power and Cooling Q750 PS; 2x GTX 460 video card; Windows 7 X64.

Machine 3:

Dell Dimension 8400, 3.2GHz P4 4x512GB Ram, Video card GTX 460, Windows 7 X32

I am currently folding just on the 5x GTX 460's for aprox. 70K PPD
Location: Salem. OR USA

Re: UNSTABLE_MACHINE resets GPU at ~1.51%

Post by P5-133XL »

Not even close to enough information to diagnose. Please include the entire log, including the system and config portions and everything from where you got the WU's to after it failed
Image
Joe_H
Site Admin
Posts: 7937
Joined: Tue Apr 21, 2009 4:41 pm
Hardware configuration: Mac Pro 2.8 quad 12 GB smp4
MacBook Pro 2.9 i7 8 GB smp2
Location: W. MA

Re: UNSTABLE_MACHINE resets GPU at ~1.51%

Post by Joe_H »

You provided too small a section of your log to tell much of anything from. Please provide the beginning section of the log which contains version and system information as well as the system configuration so we can tell what folding slots correspond to what. The Welcome to the Forum posts at the top of this sub-forum give useful information on how to post and find log information.

Also post where the error messages start, the posted messages are somewhat after the fact.
Image

iMac 2.8 i7 12 GB smp8, Mac Pro 2.8 quad 12 GB smp6
MacBook Pro 2.9 i7 8 GB smp3
bruce
Posts: 20824
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: UNSTABLE_MACHINE resets GPU at ~1.51%

Post by bruce »

The default installation for any system that folds with with a mixture of ATI and NVidia GPUs is likely to be misconfiguration. Providing the requested log information will certainly be helpful. I recommend you start by removing either type of GPU, leaving a single type. That will change the configuration and there's a (slim) chance it will start folding. If not, try deleting all of the GPU slots and let the system rebuild them. With more information, the next steps will become clear.
7im
Posts: 10179
Joined: Thu Nov 29, 2007 4:30 pm
Hardware configuration: Intel i7-4770K @ 4.5 GHz, 16 GB DDR3-2133 Corsair Vengence (black/red), EVGA GTX 760 @ 1200 MHz, on an Asus Maximus VI Hero MB (black/red), in a blacked out Antec P280 Tower, with a Xigmatek Night Hawk (black) HSF, Seasonic 760w Platinum (black case, sleeves, wires), 4 SilenX 120mm Case fans with silicon fan gaskets and silicon mounts (all black), a 512GB Samsung SSD (black), and a 2TB Black Western Digital HD (silver/black).
Location: Arizona
Contact:

Re: UNSTABLE_MACHINE resets GPU at ~1.51%

Post by 7im »

05:21:12:WARNING:WU02:FS00:FahCore returned: UNSTABLE_MACHINE (122 = 0x7a)
05:45:04:WARNING:WU01:FS01:FahCore returned: UNSTABLE_MACHINE (122 = 0x7a)

Very unusual to have two failures within minutes of each other, unless there was a driver crash, or the system was overheating. This points more towards a systemic issue than a work unit or client issue, otherwise only one would have failed, not both, IMO. Having bad weather? Power brown out from lightning or too much AC? System on a UPS? Gaming while folding? Lot's of things can contribute to a problem like this.

Like they said above, please post more info.
How to provide enough information to get helpful support
Tell me and I forget. Teach me and I remember. Involve me and I learn.
ntsarb
Posts: 10
Joined: Fri Sep 13, 2013 6:38 pm

Re: UNSTABLE_MACHINE resets GPU at ~1.51%

Post by ntsarb »

Many thanks for your responses. I have now run Windows Memory Diagnostics utility (extended run) and did not find any issues.

The previous log was automatically deleted. I let FaH run overnight and found lots of errors in the morning, but it's too long to fit into a forum's message (over 60000 characters). An alternative on how to share it? Here's the configuration part:

Code: Select all

*********************** Log Started 2013-09-13T22:31:51Z ***********************
22:31:51:************************* Folding@home Client *************************
22:31:51:      Website: http://folding.stanford.edu/
22:31:51:    Copyright: (c) 2009-2013 Stanford University
22:31:51:       Author: Joseph Coffland <joseph@cauldrondevelopment.com>
22:31:51:         Args: --open-web-control
22:31:51:       Config: C:/Users/Nikos/AppData/Roaming/FAHClient/config.xml
22:31:51:******************************** Build ********************************
22:31:51:      Version: 7.3.6
22:31:51:         Date: Feb 18 2013
22:31:51:         Time: 15:25:17
22:31:51:      SVN Rev: 3923
22:31:51:       Branch: fah/trunk/client
22:31:51:     Compiler: Intel(R) C++ MSVC 1500 mode 1200
22:31:51:      Options: /TP /nologo /EHa /Qdiag-disable:4297,4103,1786,279 /Ox -arch:SSE
22:31:51:               /QaxSSE2,SSE3,SSSE3,SSE4.1,SSE4.2 /Qopenmp /Qrestrict /MT /Qmkl
22:31:51:     Platform: win32 XP
22:31:51:         Bits: 32
22:31:51:         Mode: Release
22:31:51:******************************* System ********************************
22:31:51:          CPU: Intel(R) Core(TM) i7 CPU 860 @ 2.80GHz
22:31:51:       CPU ID: GenuineIntel Family 6 Model 30 Stepping 5
22:31:51:         CPUs: 8
22:31:51:       Memory: 16.00GiB
22:31:51:  Free Memory: 13.16GiB
22:31:51:      Threads: WINDOWS_THREADS
22:31:51:  Has Battery: false
22:31:51:   On Battery: false
22:31:51:   UTC offset: 1
22:31:51:          PID: 7908
22:31:51:          CWD: C:/Users/Nikos/AppData/Roaming/FAHClient
22:31:51:           OS: Windows 7 Home Premium
22:31:51:      OS Arch: AMD64
22:31:51:         GPUs: 2
22:31:51:        GPU 0: NVIDIA:3 GK106 [GeForce GTX 660]
22:31:51:        GPU 1: NVIDIA:3 GK106 [GeForce GTX 660]
22:31:51:         CUDA: 3.0
22:31:51:  CUDA Driver: 5050
22:31:51:Win32 Service: false
22:31:51:***********************************************************************
ntsarb
Posts: 10
Joined: Fri Sep 13, 2013 6:38 pm

Re: UNSTABLE_MACHINE resets GPU at ~1.51%

Post by ntsarb »

By filtering the Log entries per "Slot", I found that only the first GPU is exhibiting the problem. Notably, the second GPU and the CPU did not give any errors.

I run FaH using the following (one GPU at a time) configurations:

1) 1st GPU card on 1st PCI-E slot, 2nd PCI-E slot empty
Produces errors within 2-3 minutes.

2) 1st GPU card on 2nd PCI-E slot, 1st PCI-E slot empty
No errors.

3) 2nd GPU card on 1st PCI-E slot, 2nd PCI-E slot empty
Produces errors within 2-3 minutes.

4) GPU card on 2nd PCI-E slot, 1st PCI-E slot empty
No errors.

Looks like the GPU cards are good and the 1st (blue) PCI-E slot on ASUS P7P55D-E EVO motherboard is problematic.

Many thanks to 7im (who helped me focus on system issues) and all users who kindly offered to help.
Last edited by ntsarb on Sat Sep 14, 2013 4:50 pm, edited 4 times in total.
Post Reply