List of SMP WUs with the "1 core usage" issue

Post by **toTOW** » Mon Aug 17, 2009 6:22 pm

Please use this thread to report all WUs with this issue : SMP WUs using only one core (other symptom is a reduced download size of about 1.5 MB). Most reports are from Linux, but it might affect OSX too (no reports on Windows yet).

Known WUs with this issue :

Project: 2669 (Run 0, Clone 32, Gen 188)
Project: 2669 (Run 0, Clone 131, Gen 128)
Project: 2665 (Run 0,Clone 616, Gen 118)
Project: 2669 (Run 1, Clone 164, Gen 150)
Project: 2669 (Run 3, Clone 67, Gen 66)
Project: 2669 (Run 4, Clone 180, Gen 199)
Project: 2669 (Run 5, Clone 57, Gen 194)
Project: 2669 (Run 7, Clone 51, Gen 110)
Project: 2669 (Run 11, Clone 11, Gen 99)
Project: 2669 (Run 12, Clone 50, Gen 160)
Project: 2669 (Run 13, Clone 29, Gen 178)
Project: 2669 (Run 13, Clone 29, Gen 179)
Project: 2669 (Run 14, Clone 166, Gen 39)
Project: 2669 (Run 15, Clone 62, Gen 193)
Project: 2669 (Run 15, Clone 175, Gen 185)
Project: 2669 (Run 17, Clone 94, Gen 118)
Project: 2671 (Run 12, Clone 40, Gen 88)
Project: 2671 (Run 32, Clone 41, Gen 88)
Project: 2671 (Run 37, Clone 79, Gen 78)
Project: 2671 (Run 52, Clone 43, Gen 82)
Project: 2672 (Run 0, Clone 9, Gen 194)
Project: 2675 (Run 2, Clone 45, Gen 150)
Project: 2675 (Run 2, Clone 93, Gen 141)
Project: 2675 (Run 3, Clone 182, Gen 153)
Project: 2677 (Run 1, Clone 30, Gen 31)
Project: 2677 (Run 1, Clone 70, Gen 38)
Project: 2677 (Run 3, Clone 10, Gen 36)
Project: 2677 (Run 3, Clone 57, Gen 29)
Project: 2677 (Run 3, Clone 78, Gen 28)
Project: 2677 (Run 3, Clone 79, Gen 33)
Project: 2677 (Run 5, Clone 21, Gen 30)
Project: 2677 (Run 5, Clone 43, Gen 32)
Project: 2677 (Run 5, Clone 54, Gen 34)
Project: 2677 (Run 8, Clone 40, Gen 34)
Project: 2677 (Run 10, Clone 62, Gen 34)
Project: 2677 (Run 14, Clone 50, Gen 28)
Project: 2677 (Run 17, Clone 35, Gen 36)
Project: 2677 (Run 17, Clone 42, Gen 34)
Project: 2677 (Run 17, Clone 74, Gen 29)
Project: 2677 (Run 19, Clone 51, Gen 34)
Project: 2677 (Run 19, Clone 97, Gen 30)
Project: 2677 (Run 23, Clone 74, Gen 28)
Project: 2677 (Run 23, Clone 97, Gen 31)
Project: 2677 (Run 24, Clone 13, Gen 29)
Project: 2677 (Run 24, Clone 17, Gen 32)
Project: 2677 (Run 25, Clone 35, Gen 32)
Project: 2677 (Run 27, Clone 19, Gen 30)
Project: 2677 (Run 27, Clone 20, Gen 31)
Project: 2677 (Run 30, Clone 64, Gen 26)
Project: 2677 (Run 33, Clone 42, Gen 31)
Project: 2677 (Run 33, Clone 64, Gen 28)
Project: 2677 (Run 34, Clone 17, Gen 37)
Project: 2677 (Run 34, Clone 40, Gen 30)
Project: 2677 (Run 35, Clone 54, Gen 25)
Project: 2677 (Run 35, Clone 76, Gen 35)

I'll forward this thread to kasson so that he could keep tracks of bad WUs.

ChasR · Post by **ChasR** » Mon Aug 17, 2009 9:03 pm

All running core 2.08

Project: 2669 (Run 1, Clone 164, Gen 150)
Project: 2669 (Run 17, Clone 94, Gen 118)
Project: 2677 (Run 5, Clone 43, Gen 32)
Project: 2677 (Run 19, Clone 51, Gen 34)

Do you suppose these are all bad projects or a bug in core 2.08?
I suspect some of the FahCore_a2 processes fail to close on completion of the previous WU. One machine I checked had 6 core processes open with only one using cpu time.

Edit:
Closing FAH left 2 cores shown stopped. I killed those processes and restarted. 4 core processes started and began to use cpu time but three quickly went to 0% usage. Not what I suspected.

Post by **toTOW** » Mon Aug 17, 2009 9:20 pm

I think this is an issue with the WUs ... but we need to wait for mare details if kasson can figure out what's going on.

parkut · Post by **parkut** » Tue Aug 18, 2009 1:49 am

Project: 2677 (Run 10, Clone 62, Gen 34) another WU utilizing only one core
compressed_data_size=1504548 (1.5meg download)

Code: Select all

conroe11.parkut.com
 21:42:01 up 8 days, 18:45,  0 users,  load average: 0.99, 1.00, 1.70
11694 97.5 11694 S ?        00:10:44 ./FahCore_a2.exe -dir work/ -nice 19 -suffix 04 -checkpoint 10 -verbose -lifeline 2918 -version 624
11697  0.5 11697 S ?        00:00:03 ./FahCore_a2.exe -dir work/ -nice 19 -suffix 04 -checkpoint 10 -verbose -lifeline 2918 -version 624
11695  0.2 11695 S ?        00:00:01 ./FahCore_a2.exe -dir work/ -nice 19 -suffix 04 -checkpoint 10 -verbose -lifeline 2918 -version 624
11696  0.2 11696 S ?        00:00:01 ./FahCore_a2.exe -dir work/ -nice 19 -suffix 04 -checkpoint 10 -verbose -lifeline 2918 -version 624
...
model name	: Intel(R) Core(TM)2 Duo CPU     E8400  @ 3.00GHz
cpu MHz		: 3010.135
cache size	: 6144 KB
Memory: 1.96 GB physical, 1.94 GB virtual
...
Client Version 6.24R3  
Core: FahCore_a2.exe
Core Version 2.08 (Mon May 18 14:47:42 PDT 2009)
Current Work Unit
-----------------
Name: p2677_IBX in water
Tag: P2677R10C62G34
Download time: August 17 09:29:37
Due time: August 20 09:29:37
Progress: 0%  [__________]
...
Project: 2677 (Run 10, Clone 62, Gen 34) 1920.00 pts

HayesK · Post by **HayesK** » Tue Aug 18, 2009 1:56 am

I have had similar problem with seven different "bad" A2 WU over past week. Looks like they are using one core up to 100%, then switching to another core. Even got the same "bad" WU on 2 different rigs. Most have been on 2 core VMs running Ubuntu 8.04, but caught a P2671 today on a Q9550 running Ubuntu 8.04, which was taking 53 minutes/frame rather than typical 4 minutes/frame. Rigs which typically take 10 minutes/frame have gone to 65 minutes/frame. When the first "bad" WU arrived about a week ago, I thought something was wrong with the VM, but even after rebooting the rig and moving the WU to a Q9550 Ubuntu 8.04 rig, the WU still had the single core long frame performance problem. My solution has been to sort the HFM "frame time" column and delete the "bad" WU when detected.

Below is a list of the "bad" WU received.

F47-VM1, p2677 (19-51-34) (dup with ChasR)
F41-VM2, p2677 (33-42-31)
F32-Linux1, p2669 (12-50-160)
F45-VM1, p2669 (14-166-39)
F38-VM1, p2677 (17-74-29) (dup with ToTOW)
F43-VM1, p2677 (19-51-34) (dup with ChasR and HayesK)
F41-VM1, p2677 (34-40-30)
F30-Linux1, p2671 (37-79-78)

pclement · Post by **pclement** » Tue Aug 18, 2009 1:58 am

Project 2677 (R30, C64, G26) is having the issue also.

parkut · Post by **parkut** » Tue Aug 18, 2009 2:03 am

Project: 2677 (Run 34, Clone 40, Gen 30) is another WU with this issue compressed_data_size=1501726

conroe11.parkut.com
21:48:01 up 8 days, 18:51, 1 user, load average: 1.06, 1.04, 1.48

skinnykid63 · Post by **skinnykid63** » Tue Aug 18, 2009 3:43 am

Project 2677 (Run 23, Clone 97, Gen 31) and 2677(24, 13, 29) only using one core here.

Post by **toTOW** » Tue Aug 18, 2009 9:21 am

First post edited with new reported WUs.

Post by **susato** » Tue Aug 18, 2009 9:45 am

P2665 R0 C616 G118 appears to be another one - odd because it's a core_a1 WU.

Thanks ToTOW and MtM.

Oldhat · Post by **Oldhat** » Tue Aug 18, 2009 12:09 pm

Just got my first one of these.

Initially thought it might have been the x720 triple core having problems, but the q9450 is just as bad.

Project: 2677 (Run 27, Clone 20, Gen 31)

Folding@Home Client Version 6.02
Arguments: -local -forceasm -smp 4
Version 2.08 (Mon May 18 14:47:42 PDT 2009)

Deleted the above unit and received Project: 2677 (Run 29, Clone 9, Gen 35) and this is back to frame times of 5:31

Post by **toTOW** » Tue Aug 18, 2009 12:53 pm

Added last p2665 and p2677 mentioned above.

This issue seems to be spreading to A1 core ...

BrokenWolf · Post by **BrokenWolf** » Tue Aug 18, 2009 1:21 pm

Got another one while I was sleeping. P2669 R5/C57/G194

It would take 7 more days to finish, downloaded 8 hrs ago, 253.54ppd.

Code: Select all

[04:58:10] Connecting to http://assign.stanford.edu:8080/
[04:58:10] Posted data.
[04:58:10] Initial: 40AB; - Successful: assigned to (171.64.65.56).
[04:58:10] + News From Folding@Home: Welcome to Folding@Home
[04:58:10] Loaded queue successfully.
[04:58:10] Connecting to http://171.64.65.56:8080/
[04:58:15] Posted data.
[04:58:15] Initial: 0000; - Receiving payload (expected size: 1512874)
[04:58:17] - Downloaded at ~738 kB/s
[04:58:17] - Averaged speed for that direction ~1029 kB/s
[04:58:17] + Received work.
[04:58:17] Trying to send all finished work units
[04:58:17] + No unsent completed units remaining.
[04:58:17] + Closed connections
[04:58:17] 
[04:58:17] + Processing work unit
[04:58:17] At least 4 processors must be requested.Core required: FahCore_a2.exe
[04:58:17] Core found.
[04:58:17] Working on queue slot 01 [August 18 04:58:17 UTC]
[04:58:17] + Working ...
[04:58:17] - Calling './mpiexec -np 4 -host 127.0.0.1 ./FahCore_a2.exe -dir work/ -suffix 01 -priority 96 -checkpoint 15 -verbose -lifeline 31119 -version 624'

[04:58:17] 
[04:58:17] *------------------------------*
[04:58:17] Folding@Home Gromacs SMP Core
[04:58:17] Version 2.08 (Mon May 18 14:47:42 PDT 2009)
[04:58:17] 
[04:58:17] Preparing to commence simulation
[04:58:17] - Ensuring status. Please wait.
[04:58:18] Called DecompressByteArray: compressed_data_size=1512362 data_size=23981533, decompressed_data_size=23981533 diff=0
[04:58:18] - Digital signature verified
[04:58:18] 
[04:58:18] Project: 2669 (Run 5, Clone 57, Gen 194)
[04:58:18] 
[04:58:18] Assembly optimizations on if available.
[04:58:18] Entering M.D.
[04:58:28] Run 5, Clone 57, Gen 194)
[04:58:28] 
[04:58:28] Entering M.D.
[04:59:01] Completed 0 out of 250000 steps  (0%)
[06:47:51] Completed 2500 out of 250000 steps  (1%)
[08:36:53] Completed 5000 out of 250000 steps  (2%)
[09:49:38] - Autosending finished units... [August 18 09:49:38 UTC]
[09:49:38] Trying to send all finished work units
[09:49:38] + No unsent completed units remaining.
[09:49:38] - Autosend completed
[10:25:55] Completed 7500 out of 250000 steps  (3%)
[12:14:58] Completed 10000 out of 250000 steps  (4%)

BrokenWolf · Post by **BrokenWolf** » Tue Aug 18, 2009 7:30 pm

I have received P2677 R19/C51/G34 on a client that has the 2.07 a2 core and it barfed right after it put Entering M.D. in the FAHlog.txt. I had to ctrl-C to stop the client but it was not processing anything. To me it looks like the 2.07 a2 cores barf and the 2.08 a2 cores process on one cpu/core.

BW

From the log file:

Code: Select all

[19:04:23] - Will indicate memory of 3040 MB
[19:04:23] - Connecting to assignment server
[19:04:23] Connecting to http://assign.stanford.edu:8080/
[19:04:24] Posted data.
[19:04:24] Initial: 40AB; - Successful: assigned to (171.64.65.56).
[19:04:24] + News From Folding@Home: Welcome to Folding@Home
[19:04:24] Loaded queue successfully.
[19:04:24] Connecting to http://171.64.65.56:8080/
[19:04:31] Posted data.
[19:04:31] Initial: 0000; - Receiving payload (expected size: 1503377)
[19:04:32] - Downloaded at ~1468 kB/s
[19:04:32] - Averaged speed for that direction ~1121 kB/s
[19:04:32] + Received work.
[19:04:33] Trying to send all finished work units
[19:04:33] + No unsent completed units remaining.
[19:04:33] + Closed connections
[19:04:33] 
[19:04:33] + Processing work unit
[19:04:33] At least 4 processors must be requested.Core required: FahCore_a2.exe
[19:04:33] Core found.
[19:04:33] Working on queue slot 00 [August 18 19:04:33 UTC]
[19:04:33] + Working ...
[19:04:33] - Calling './mpiexec -np 4 -host 127.0.0.1 ./FahCore_a2.exe -dir work/ -suffix 00 -priority 96 -checkpoint 15 -verbose -lifeline 6256 -version 624'

[19:04:33] 
[19:04:33] *------------------------------*
[19:04:33] Folding@Home Gromacs SMP Core
[19:04:33] Version 2.07 (Sun Apr 19 14:51:09 PDT 2009)
[19:04:33] 
[19:04:33] Preparing to commence simulation
[19:04:33] - Ensuring status. Please wait.
[19:04:33] Called DecompressByteArray: compressed_data_size=1502865 data_size=24031357, decompressed_data_size=24031357 diff=0
[19:04:34] - Digital signature verified
[19:04:34] 
[19:04:34] Project: 2677 (Run 19, Clone 51, Gen 34)
[19:04:34] 
[19:04:34] Assembly optimizations on if available.
[19:04:34] Entering M.D.
[19:04:43] Run 19, Clone 51, Gen 34)
[19:04:43] 
[19:04:43] Entering M.D.
[19:04:53] 
[19:04:53] Folding@home Core Shutdown: INTERRUPTED

From the screen:

Code: Select all

[19:04:31] Initial: 0000; - Receiving payload (expected size: 1503377)
[19:04:32] - Downloaded at ~1468 kB/s
[19:04:32] - Averaged speed for that direction ~1121 kB/s
[19:04:32] + Received work.
[19:04:33] Trying to send all finished work units
[19:04:33] + No unsent completed units remaining.
[19:04:33] + Closed connections
[19:04:33]
[19:04:33] + Processing work unit
[19:04:33] At least 4 processors must be requested.Core required: FahCore_a2.exe[19:04:33] Core found.
[19:04:33] Working on queue slot 00 [August 18 19:04:33 UTC]
[19:04:33] + Working ...
[19:04:33] - Calling './mpiexec -np 4 -host 127.0.0.1 ./FahCore_a2.exe -dir work/ -suffix 00 -priority 96 -checkpoint 15 -verbose -lifeline 6256 -version 624'

[19:04:33]
[19:04:33] *------------------------------*
[19:04:33] Folding@Home Gromacs SMP Core
[19:04:33] Version 2.07 (Sun Apr 19 14:51:09 PDT 2009)
[19:04:33]
[19:04:33] Preparing to commence simulation
[19:04:33] - Ensuring status. Please wait.
[19:04:33] Called DecompressByteArray: compressed_data_size=1502865 data_size=24031357, decompressed_data_size=24031357 diff=0
[19:04:34] - Digital signature verified
[19:04:34]
[19:04:34] Project: 2677 (Run 19, Clone 51, Gen 34)
[19:04:34]
[19:04:34] Assembly optimizations on if available.
[19:04:34] Entering M.D.
[19:04:43] Run 19, Clone 51, Gen 34)
[19:04:43]
[19:04:43] Entering M.D.
NNODES=4, MYRANK=0, HOSTNAME=rhel4bw1.lab1.com
NNODES=4, MYRANK=1, HOSTNAME=rhel4bw1.lab1.com
NNODES=4, MYRANK=3, HOSTNAME=rhel4bw1.lab1.com
NNODES=4, MYRANK=2, HOSTNAME=rhel4bw1.lab1.com
NODEID=0 argc=20
NODEID=2 argc=20
                         :-)  G  R  O  M  A  C  S  (-:

                   Groningen Machine for Chemical Simulation

                 :-)  VERSION 4.0.99_development_20090307  (-:


      Written by David van der Spoel, Erik Lindahl, Berk Hess, and others.
       Copyright (c) 1991-2000, University of Groningen, The Netherlands.
             Copyright (c) 2001-2008, The GROMACS development team,
            check out http://www.gromacs.org for more information.


                                :-)  mdrun  (-:

Reading file work/wudata_00.tpr, VERSION 3.3.99_development_20070618 (single precision)
NODEID=1 argc=20
NODEID=3 argc=20
Note: tpx file_version 48, software version 64

NOTE: The tpr file used for this simulation is in an old format, for less memory usage and possibly more performance create a new tpr file with an up to date version of grompp

Making 1D domain decomposition 1 x 1 x 4
starting mdrun '22869 system in water'
8750001 steps,  17500.0 ps (continuing from step 8500001,  17000.0 ps).

-------------------------------------------------------
Program mdrun, VERSION 4.0.99_development_20090307
Source code file: nsgrid.c, line: 357

Range checking error:
Explanation: During neighborsearching, we assign each particle to a grid
based on its coordinates. If your system contains collisions or parameter
errors that give particles very high velocities you might end up with some
coordinates being +-Infinity or NaN (not-a-number). Obviously, we cannot
put these on a grid, so this is usually where we detect those errors.
Make sure your system is properly energy-minimized and that the potential
energy seems reasonable before trying again.

Variable ci has value -2147483269. It should have been within [ 0 .. 9464 ]

For more information and tips for trouble shooting please check the GROMACS Wiki at
http://wiki.gromacs.org/index.php/Errors
-------------------------------------------------------

Thanx for Using GROMACS - Have a Nice Day

Error on node 0, will try to stop all the nodes

-------------------------------------------------------
Program mdrun, VERSION 4.0.99_development_20090307
Source code file: nsgrid.c, line: 357

Range checking error:
Explanation: During neighborsearching, we assign each particle to a grid
based on its coordinates. If your system contains collisions or parameter
errors that give particles very high velocities you might end up with some
coordinates being +-Infinity or NaN (not-a-number). Obviously, we cannot
put these on a grid, so this is usually where we detect those errors.
Make sure your system is properly energy-minimized and that the potential
energy seems reasonable before trying again.

Variable ci has value -2147483611. It should have been within [ 0 .. 256 ]

For more information and tips for trouble shooting please check the GROMACS Wiki at
http://wiki.gromacs.org/index.php/Errors
-------------------------------------------------------

Thanx for Using GROMACS - Have a Nice Day

Error on node 3, will try to stop all the nodes
Halting parallel program mdrun on CPU 3 out of 4

gcq#0: Thanx for Using GROMACS - Have a Nice Day

[cli_3]: aborting job:
application called MPI_Abort(MPI_COMM_WORLD, -1) - process 3
Halting parallel program mdrun on CPU 0 out of 4

g[19:04:53]
[19:04:53] Folding@home Core Shutdown: INTERRUPTED
cq#0: Thanx for Using GROMACS - Have a Nice Day

[cli_0]: aborting job:
application called MPI_Abort(MPI_COMM_WORLD, -1) - process 0

parkut · Post by **parkut** » Wed Aug 19, 2009 12:43 pm

Another one. Project: 2677 (Run 8, Clone 40, Gen 34) compressed_data_size=1496912

Code: Select all

conroe8.parkut.com
 08:36:01 up 10 days,  5:38,  0 users,  load average: 1.02, 1.01, 1.00

Folding Forum

List of SMP WUs with the "1 core usage" issue

List of SMP WUs with the "1 core usage" issue

Re: List of A2 WU with the "1 core usage" issue

Re: List of A2 WU with the "1 core usage" issue

Re: List of A2 WU with the "1 core usage" issue

Re: List of A2 WU with the "1 core usage" issue

Re: List of A2 WU with the "1 core usage" issue

Re: List of A2 WU with the "1 core usage" issue

Re: List of A2 WU with the "1 core usage" issue

Re: List of A2 WU with the "1 core usage" issue

Re: List of A2 WU with the "1 core usage" issue

Re: List of A2 WU with the "1 core usage" issue

Re: List of SMP WUs with the "1 core usage" issue

Re: List of SMP WUs with the "1 core usage" issue

Re: List of SMP WUs with the "1 core usage" issue

Re: List of SMP WUs with the "1 core usage" issue