Project: 9406, several "potential energy" errors

Moderators: Site Moderators, FAHC Science Team

Post Reply
ChristianVirtual
Posts: 1576
Joined: Tue May 28, 2013 12:14 pm
Location: Tokyo

Project: 9406, several "potential energy" errors

Post by ChristianVirtual »

A bad WU ?

On GTX 780, Ubuntu 14.04, WU before and after are ok

Code: Select all

07:38:40:WU00:FS03:0x17:*********************** Log Started 2014-07-31T07:38:39Z ***********************
07:38:40:WU00:FS03:0x17:Project: 9406 (Run 661, Clone 0, Gen 94)
07:38:40:WU00:FS03:0x17:Unit: 0x000000920a3b1e5c533e43ae5193ba65
07:38:40:WU00:FS03:0x17:CPU: 0x00000000000000000000000000000000
07:38:40:WU00:FS03:0x17:Machine: 3
07:38:40:WU00:FS03:0x17:Reading tar file state.xml
07:38:40:WU00:FS03:0x17:Reading tar file system.xml
07:38:41:WU00:FS03:0x17:Reading tar file integrator.xml
07:38:41:WU00:FS03:0x17:Reading tar file core.xml
07:38:41:WU00:FS03:0x17:Digital signatures verified
07:38:48:WU02:FS03:Upload complete
07:38:48:WU02:FS03:Server responded WORK_ACK (400)
07:38:48:WU02:FS03:Final credit estimate, 28409.00 points
07:38:48:WU02:FS03:Cleaning up
07:42:09:WU00:FS03:0x17:ERROR:exception: Potential energy error of 30.6949, threshold of 10
07:42:09:WU00:FS03:0x17:ERROR:Reference Potential Energy: -904184 | Given Potential Energy: -904215
07:42:09:WU00:FS03:0x17:Saving result file logfile_01.txt
07:42:09:WU00:FS03:0x17:Saving result file badStateCheckpoint_253423090
07:42:10:WU00:FS03:0x17:Saving result file badStateForceGroup0_253423090Core.xml
07:42:13:WU00:FS03:0x17:Saving result file badStateForceGroup0_253423090Ref.xml
07:42:16:WU00:FS03:0x17:Saving result file badStateForceGroup1_253423090Core.xml
07:42:18:WU00:FS03:0x17:Saving result file badStateForceGroup1_253423090Ref.xml
07:42:21:WU00:FS03:0x17:Saving result file badStateForceGroup2_253423090Core.xml
07:42:23:WU00:FS03:0x17:Saving result file badStateForceGroup2_253423090Ref.xml
07:42:26:WU00:FS03:0x17:Saving result file log.txt
07:42:26:WU00:FS03:0x17:Folding@home Core Shutdown: BAD_WORK_UNIT
07:42:26:WARNING:WU00:FS03:FahCore returned: BAD_WORK_UNIT (114 = 0x72)
07:42:26:WU00:FS03:Sending unit results: id:00 state:SEND error:FAULTY project:9406 run:661 clone:0 gen:94 core:0x17 unit:0x000000920a3b1e5c533e43ae5193ba65
07:42:26:WU00:FS03:Uploading 26.00MiB to 171.64.65.56
07:42:26:WU00:FS03:Connecting to 171.64.65.56:8080
07:42:27:WU02:FS03:Connecting to 171.67.108.201:80
07:42:32:WU00:FS03:Upload 33.41%
07:42:38:WU00:FS03:Upload 75.00%
07:42:43:WU00:FS03:Upload complete
07:42:43:WU00:FS03:Server responded WORK_ACK (400)
07:42:43:WU00:FS03:Cleaning up
And in case the config

Code: Select all

11:14:57:************************* Folding@home Client *************************
11:14:57: Website: http://folding.stanford.edu/
11:14:57: Copyright: (c) 2009-2014 Stanford University
11:14:57: Author: Joseph Coffland <joseph@cauldrondevelopment.com>
11:14:57: Args: --child --lifeline 1126 /etc/fahclient/config.xml --run-as
11:14:57: fahclient --pid-file=/var/run/fahclient.pid --daemon
11:14:57: Config: /etc/fahclient/config.xml
11:14:57:******************************** Build ********************************
11:14:57: Version: 7.4.4
11:14:57: Date: Mar 4 2014
11:14:57: Time: 12:02:38
11:14:57: SVN Rev: 4130
11:14:57: Branch: fah/trunk/client
11:14:57: Compiler: GNU 4.4.7
11:14:57: Options: -std=gnu++98 -O3 -funroll-loops -mfpmath=sse -ffast-math
11:14:57: -fno-unsafe-math-optimizations -msse2
11:14:57: Platform: linux2 3.2.0-1-amd64
11:14:57: Bits: 64
11:14:57: Mode: Release
11:14:57:******************************* System ********************************
11:14:57: CPU: Intel(R) Core(TM) i7-2600S CPU @ 2.80GHz
11:14:57: CPU ID: GenuineIntel Family 6 Model 42 Stepping 7
11:14:57: CPUs: 8
11:14:57: Memory: 7.74GiB
11:14:57:Free Memory: 7.34GiB
11:14:57: Threads: POSIX_THREADS
11:14:57: OS Version: 3.13
11:14:57:Has Battery: false
11:14:57: On Battery: false
11:14:57: UTC Offset: 9
11:14:57: PID: 1128
11:14:57: CWD: /var/lib/fahclient
11:14:57: OS: Linux 3.13.0-29-generic x86_64
11:14:57: OS Arch: AMD64
11:14:57: GPUs: 2
11:14:57: GPU 0: NVIDIA:3 GK110 [GeForce GTX 780]
11:14:57: GPU 1: NVIDIA:3 GK110 [GeForce GTX 780]
11:14:57: CUDA: 3.5
11:14:57:CUDA Driver: 6000
11:14:57:***********************************************************************
11:14:57:<config>
11:14:57: <!-- HTTP Server -->
11:14:57: <!-- Logging -->
11:14:57: <log-rotate-max v='1024'/>
11:14:57:
11:14:57: <!-- Network -->
11:14:57: <proxy v=':8080'/>
11:14:57:
11:14:57:
11:14:57: <!-- Slot Control -->
11:14:57: <power v='full'/>
11:14:57:
11:14:57:
11:14:57: <!-- Folding Slots -->
11:14:57: <slot id='0' type='CPU'>
11:14:57: <client-type v='beta'/>
11:14:57: <cpus v='6'/>
11:14:57: <pause-on-start v='true'/>
11:14:57: <paused v='true'/>
11:14:57: </slot>
11:14:57: <slot id='3' type='GPU'>
11:14:57: <client-type v='beta'/>
11:14:57: <pause-on-start v='true'/>
11:14:57: <paused v='true'/>
11:14:57: </slot>
11:14:57: <slot id='1' type='GPU'>
11:14:57: <client-type v='beta'/>
11:14:57: <pause-on-start v='true'/>
11:14:57: <paused v='true'/>
11:14:57: </slot>
11:14:57:</config>
11:14:57:Switching to user fahclient
11:14:57:Trying to access database...
11:14:57:Successfully acquired database lock
11:14:57:Enabled folding slot 00: PAUSED cpu:6 (by user)
11:14:57:Enabled folding slot 03: PAUSED gpu:0:GK110 [GeForce GTX 780] (by user)
11:14:57:Enabled folding slot 01: PAUSED gpu:1:GK110 [GeForce GTX 780] (by user)
11:16:39:FS03:Unpaused
Slots got unpaused
Last edited by ChristianVirtual on Sat Sep 06, 2014 5:19 am, edited 2 times in total.
ImageImage
Please contribute your logs to http://ppd.fahmm.net
P5-133XL
Posts: 2948
Joined: Sun Dec 02, 2007 4:36 am
Hardware configuration: Machine #1:

Intel Q9450; 2x2GB=8GB Ram; Gigabyte GA-X48-DS4 Motherboard; PC Power and Cooling Q750 PS; 2x GTX 460; Windows Server 2008 X64 (SP1).

Machine #2:

Intel Q6600; 2x2GB=4GB Ram; Gigabyte GA-X48-DS4 Motherboard; PC Power and Cooling Q750 PS; 2x GTX 460 video card; Windows 7 X64.

Machine 3:

Dell Dimension 8400, 3.2GHz P4 4x512GB Ram, Video card GTX 460, Windows 7 X32

I am currently folding just on the 5x GTX 460's for aprox. 70K PPD
Location: Salem. OR USA

Re: Project: 9406 (Run 661, Clone 0, Gen 94), potential ener

Post by P5-133XL »

Someone else successfully completed the wu ...
Image
ChristianVirtual
Posts: 1576
Joined: Tue May 28, 2013 12:14 pm
Location: Tokyo

Re: Project: 9406 (Run 661, Clone 0, Gen 94), potential ener

Post by ChristianVirtual »

Thanks
ImageImage
Please contribute your logs to http://ppd.fahmm.net
ChristianVirtual
Posts: 1576
Joined: Tue May 28, 2013 12:14 pm
Location: Tokyo

Re: Project: 9406 (Run 661, Clone 0, Gen 94), potential ener

Post by ChristianVirtual »

And another one: 9406 (Run 76, Clone 0, Gen 112)

Same hardware as above (Ubuntu 14.04, GTX 780, no OC)

Code: Select all

03:56:05:WU03:FS01:0x17:*********************** Log Started 2014-08-07T03:56:05Z ***********************
03:56:05:WU03:FS01:0x17:Project: 9406 (Run 76, Clone 0, Gen 112)
03:56:05:WU03:FS01:0x17:Unit: 0x000000a50a3b1e5c533dda2c423aee18
03:56:05:WU03:FS01:0x17:CPU: 0x00000000000000000000000000000000
03:56:05:WU03:FS01:0x17:Machine: 1
03:56:05:WU03:FS01:0x17:Reading tar file state.xml
03:56:06:WU03:FS01:0x17:Reading tar file system.xml
03:56:06:WU03:FS01:0x17:Reading tar file integrator.xml
03:56:06:WU03:FS01:0x17:Reading tar file core.xml
03:56:06:WU03:FS01:0x17:Digital signatures verified
03:56:11:WU00:FS01:Upload 64.05%
03:56:26:WU00:FS01:Upload complete
03:56:26:WU00:FS01:Server responded WORK_ACK (400)
03:56:26:WU00:FS01:Final credit estimate, 38708.00 points
03:56:26:WU00:FS01:Cleaning up
03:57:45:WU01:FS00:0xa3:Completed 480000 out of 500000 steps (96%)

03:59:38:WU03:FS01:0x17:ERROR:exception: Potential energy error of 12.4874, threshold of 10
03:59:38:WU03:FS01:0x17:ERROR:Reference Potential Energy: -913192 | Given Potential Energy: -913204
03:59:38:WU03:FS01:0x17:Saving result file logfile_01.txt
03:59:38:WU03:FS01:0x17:Saving result file badStateCheckpoint_1269370778
03:59:39:WU03:FS01:0x17:Saving result file badStateForceGroup0_1269370778Core.xml
03:59:42:WU03:FS01:0x17:Saving result file badStateForceGroup0_1269370778Ref.xml
03:59:45:WU03:FS01:0x17:Saving result file badStateForceGroup1_1269370778Core.xml
03:59:47:WU03:FS01:0x17:Saving result file badStateForceGroup1_1269370778Ref.xml
03:59:50:WU03:FS01:0x17:Saving result file badStateForceGroup2_1269370778Core.xml
03:59:52:WU03:FS01:0x17:Saving result file badStateForceGroup2_1269370778Ref.xml
03:59:55:WU03:FS01:0x17:Saving result file log.txt
03:59:55:WU03:FS01:0x17:Folding@home Core Shutdown: BAD_WORK_UNIT
03:59:55:WARNING:WU03:FS01:FahCore returned: BAD_WORK_UNIT (114 = 0x72)
03:59:55:WU03:FS01:Sending unit results: id:03 state:SEND error:FAULTY project:9406 run:76 clone:0 gen:112 core:0x17 unit:0x000000a50a3b1e5c533dda2c423aee18

Following assignment is successfully started and still crunching ;temps are ok,
ImageImage
Please contribute your logs to http://ppd.fahmm.net
bollix47
Posts: 2982
Joined: Sun Dec 02, 2007 5:04 am
Location: Canada

Re: Project: 9406 (Run 661, Clone 0, Gen 94), potential ener

Post by bollix47 »

Another folder has completed this wu successfully:

Hi xxxx(team xxxx),
Your WU (P9406 R76 C0 G112) was added to the stats database on 2014-08-07 03:04:46 for 40666.5 points of credit.

Have you tried reducing your GPU's memory speed to stock specifications? Memory speed has very little effect on folding speed but when it's too high it can cause work units to fail.

This article may be helpful.
Image
bruce
Posts: 20822
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: Project: 9406 (Run 661, Clone 0, Gen 94), potential ener

Post by bruce »

Someone else successfully completed the wu ...
When a project fails, it is reissued, as I'm sure you know. In fact, more than one reassignment is often distributed. (I don't understand the exact logic, so don't ask.)

In this case, Project: 9406 (Run 661, Clone 0, Gen 94) was successfully completed TWICE and it failed THREE times. Sometimes people have reported that a particular project is "sensitive" to overheating or overclocking. In fact, that's probably a reasonable excuse to justify a marginally stable system which fails more frequently when a WU uses resources more efficiently that whatever benchmarking tools were used to define "stable" without enough margin.

Personally, I have a non-overclocked GPU which only fails when the weather is hot, so it's really a question of the default fan profile being set by the manufacturer for "quiet" rather than for "certified as stable under elevated room temperatures." My choices seem to be underclocking or fixing the fan profile or providing better case cooling.
ChristianVirtual
Posts: 1576
Joined: Tue May 28, 2013 12:14 pm
Location: Tokyo

Re: Project: 9406 (Run 661, Clone 0, Gen 94), potential ener

Post by ChristianVirtual »

My two GTX 780 are stock (as far as I know), I force via coolbits the fan to constant 80%.
One card runs often until 80C, but never pass that; the second card normally not go above 70C.

But yeah, ambient temp might be a point. Though the ac is running in the room it keeps it "only" on 30C. Main purpose of the ac is also to get out the humidity. Hot and wet summer in Japan. Not nice to hardware and humans.
I will push the fans another 5% up and see if that's makes a difference. And see if I can increase the airflow for the "hotter" card. On the other side: on both cards other WU from 9406 before and after are successfully done. These two which failed did that direct after loading and in their very first frame.

Code: Select all

PRCG	Status		TPF		Runtime		Credit	
9406 (R:171 C:1 G:14)	Status	running	TPF	4:10	Runtime	06:56:56	Credit	~ 35,817	PPD	~ 123,814
9406 (R:772 C:0 G:17)	Status	running	TPF	3:46	Runtime	06:17:59	Credit	~ 37,644	PPD	~ 143,436
9406 (R:1023 C:0 G:31)	Status	NO_ERROR	TPF	3:27	Runtime	05:45:47	Credit	39,338	PPD	163,932
9406 (R:149 C:0 G:125)	Status	NO_ERROR	TPF	3:59	Runtime	06:38:54	Credit	36,627	PPD	132,304
9406 (R:593 C:0 G:67)	Status	NO_ERROR	TPF	3:22	Runtime	05:38:13	Credit	39,776	PPD	169,464
9406 (R:141 C:0 G:87)	Status	NO_ERROR	TPF	3:31	Runtime	05:52:11	Credit	38,975	PPD	159,484
9406 (R:76 C:0 G:112)	Status	FAULTY	TPF	0:02	Runtime	00:04:02	Credit	0	PPD	0
9406 (R:126 C:1 G:25)	Status	NO_ERROR	TPF	3:22	Runtime	05:36:45	Credit	39,804	PPD	170,572
9406 (R:598 C:0 G:43)	Status	NO_ERROR	TPF	3:34	Runtime	05:57:12	Credit	38,708	PPD	156,136
9406 (R:914 C:0 G:69)	Status	NO_ERROR	TPF	3:44	Runtime	06:13:35	Credit	37,847	PPD	145,978
Here a two day plot of the temps (one day I was stopping a GTX 780 during day)
Image

The hotter card is also the one driving the UI for Ubuntu plus is installed on a vertical MB in a lower position. So sure I need to see if the airflow can be improved a bit or switch to horizontal MB orientation.
ImageImage
Please contribute your logs to http://ppd.fahmm.net
bruce
Posts: 20822
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: Project: 9406 (Run 661, Clone 0, Gen 94), potential ener

Post by bruce »

Presumably the failed WUs were assigned to GPU0 (the hotter on). Can you check that? Do you have a graph covering the times of those failures?

It is really too bad that GPU0 is so much hotter than GPU1. Moreover, I'd really like to see the ability to run that GPU at half power so that it runs somewhere between almost 80C and close to 35C (not folding) but as I understand it, that's not something FAH can provide. The only option I see is a fan speed that brings the almost 80C down to around 76C or better case airflow that does the same thing. Check IRC for a recent report on improvements to CoolBits.
ChristianVirtual
Posts: 1576
Joined: Tue May 28, 2013 12:14 pm
Location: Tokyo

Re: Project: 9406 (Run 661, Clone 0, Gen 94), potential ener

Post by ChristianVirtual »

Each GPU had one event. The later one was on the cooler card

When zooming in on that time it looks normal from the level of temps.
http://imageshack.com/a/img540/9732/3PC5QF.jpg

The temp sink is a bit wider then usual which is because the first WU failed but got few seconds later a next one and kept folding.
ImageImage
Please contribute your logs to http://ppd.fahmm.net
ChristianVirtual
Posts: 1576
Joined: Tue May 28, 2013 12:14 pm
Location: Tokyo

Re: Project: 9406 (Run 661, Clone 0, Gen 94), potential ener

Post by ChristianVirtual »

Another bad sample: 9406 (Run 689, Clone 0, Gen 38)

Else nothing changed in config, same GPU GTX 780 which crunch happily other 9406 and 9202.

Code: Select all

18:29:51:WU01:FS03:0x17:*********************** Log Started 2014-09-04T18:29:51Z ***********************
18:29:51:WU01:FS03:0x17:Project: 9406 (Run 689, Clone 0, Gen 38)
18:29:51:WU01:FS03:0x17:Unit: 0x000000370a3b1e5c533e496e044a5ca3
18:29:51:WU01:FS03:0x17:CPU: 0x00000000000000000000000000000000
18:29:51:WU01:FS03:0x17:Machine: 3
18:29:51:WU01:FS03:0x17:Reading tar file state.xml
18:29:52:WU01:FS03:0x17:Reading tar file system.xml
18:29:52:WU01:FS03:0x17:Reading tar file integrator.xml
18:29:52:WU01:FS03:0x17:Reading tar file core.xml
18:29:52:WU01:FS03:0x17:Digital signatures verified
18:29:57:WU03:FS03:Upload 95.11%
18:30:01:WU03:FS03:Upload complete
18:30:01:WU03:FS03:Server responded WORK_ACK (400)
18:30:01:WU03:FS03:Final credit estimate, 28666.00 points
18:30:01:WU03:FS03:Cleaning up
18:31:16:WU02:FS01:0x17:Completed 900000 out of 2000000 steps (45%)
18:33:14:WU01:FS03:0x17:ERROR:exception: Potential energy error of 37.3903, threshold of 10
18:33:14:WU01:FS03:0x17:ERROR:Reference Potential Energy: -886405 | Given Potential Energy: -886443
18:33:14:WU01:FS03:0x17:Saving result file logfile_01.txt
18:33:14:WU01:FS03:0x17:Saving result file badStateCheckpoint_1696143153
18:33:15:WU01:FS03:0x17:Saving result file badStateForceGroup0_1696143153Core.xml
18:33:18:WU01:FS03:0x17:Saving result file badStateForceGroup0_1696143153Ref.xml
18:33:21:WU01:FS03:0x17:Saving result file badStateForceGroup1_1696143153Core.xml
18:33:23:WU01:FS03:0x17:Saving result file badStateForceGroup1_1696143153Ref.xml
18:33:26:WU01:FS03:0x17:Saving result file badStateForceGroup2_1696143153Core.xml
18:33:28:WU01:FS03:0x17:Saving result file badStateForceGroup2_1696143153Ref.xml
18:33:30:WU01:FS03:0x17:Saving result file log.txt
18:33:30:WU01:FS03:0x17:Folding@home Core Shutdown: BAD_WORK_UNIT
18:33:31:WARNING:WU01:FS03:FahCore returned: BAD_WORK_UNIT (114 = 0x72)
18:33:31:WU01:FS03:Sending unit results: id:01 state:SEND error:FAULTY project:9406 run:689 clone:0 gen:38 core:0x17 unit:0x000000370a3b1e5c533e496e044a5ca3
18:33:31:WU01:FS03:Uploading 25.38MiB to 171.64.65.56
18:33:31:WU01:FS03:Connecting to 171.64.65.56:8080
18:33:32:WU03:FS03:Connecting to 171.67.108.201:80
18:33:37:WU01:FS03:Upload 27.58%
18:33:43:WU01:FS03:Upload 50.23%
18:33:49:WU01:FS03:Upload 82.49%
18:33:52:WU01:FS03:Upload complete
18:33:52:WU01:FS03:Server responded WORK_ACK (400)
18:33:52:WU01:FS03:Cleaning up
ImageImage
Please contribute your logs to http://ppd.fahmm.net
bruce
Posts: 20822
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: Project: 9406, several potential energy

Post by bruce »

Apparently the developers are aware of this problem and have apparently fixed it in the Windows Core17. We'll have to put up with the problem until they either update the Linux Core17 or that FahCore is classified as obsolete.

Fortunately very little time is lost before a new WU is assigned.
ChristianVirtual
Posts: 1576
Joined: Tue May 28, 2013 12:14 pm
Location: Tokyo

Re: Project: 9406, several potential energy

Post by ChristianVirtual »

Agree, the impact is rather low as the fault happen direct after start. Few minutes only. Let's wait for the fix. I guess further failure reports are not needed for this project/error condition ?
Last edited by ChristianVirtual on Sat Sep 06, 2014 5:20 am, edited 1 time in total.
ImageImage
Please contribute your logs to http://ppd.fahmm.net
bruce
Posts: 20822
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: Project: 9406, several potential energy

Post by bruce »

Right
Post Reply