SUSE VMs on Xen

FAH provides a V7 client installer for Debian / Mint / Ubuntu / RedHat / CentOS / Fedora. Installation on other distros may or may not be easy but if you can offer help to others, they would appreciate it.

Moderators: Site Moderators, FAHC Science Team

Post Reply
bigvbguy
Posts: 22
Joined: Sat Dec 29, 2007 8:15 am

SUSE VMs on Xen

Post by bigvbguy »

I'm trying to get F@H 7.4.4 running in a SLE vm on top of Xen. I have found that when I install the OS in fully virtualized mode, everything works as expected. However, when I install the VM as a paravirtualized guest, the folding client successfully downloads the work unit and begins folding but never makes any progress. Eventually the work unit fails as unstable. Any suggestions?

Code: Select all

# cat /tmp/log-20140718-184025.txt 
*********************** Log Started 2014-07-18T00:23:07Z ***********************
00:23:07:************************* Folding@home Client *************************
00:23:07:    Website: http://folding.stanford.edu/
00:23:07:  Copyright: (c) 2009-2014 Stanford University
00:23:07:     Author: Joseph Coffland <joseph@cauldrondevelopment.com>
00:23:07:       Args: --child --lifeline 28478 /etc/fahclient/config.xml --run-as
00:23:07:             fahclient --pid-file=/var/run/fahclient.pid --daemon
00:23:07:     Config: /etc/fahclient/config.xml
00:23:07:******************************** Build ********************************
00:23:07:    Version: 7.4.4
00:23:07:       Date: Mar 4 2014
00:23:07:       Time: 12:01:17
00:23:07:    SVN Rev: 4130
00:23:07:     Branch: fah/trunk/client
00:23:07:   Compiler: GNU 4.1.2 20080704 (Red Hat 4.1.2-46)
00:23:07:    Options: -std=gnu++98 -O3 -funroll-loops -mfpmath=sse -ffast-math
00:23:07:             -fno-unsafe-math-optimizations -msse2
00:23:07:   Platform: linux2 2.6.18-164.11.1.el5
00:23:07:       Bits: 64
00:23:07:       Mode: Release
00:23:07:******************************* System ********************************
00:23:07:        CPU: Intel(R) Xeon(R) CPU E7-4890 v2 @ 2.80GHz
00:23:07:     CPU ID: GenuineIntel Family 6 Model 62 Stepping 7
00:23:07:       CPUs: 116
00:23:07:     Memory: 8.01GiB
00:23:07:Free Memory: 6.82GiB
00:23:07:    Threads: POSIX_THREADS
00:23:07: OS Version: 3.12
00:23:07:Has Battery: false
00:23:07: On Battery: false
00:23:07: UTC Offset: -6
00:23:07:        PID: 28480
00:23:07:        CWD: /var/lib/fahclient
00:23:07:         OS: Linux 3.12.24-7-xen x86_64
00:23:07:    OS Arch: AMD64
00:23:07:       GPUs: 0
00:23:07:       CUDA: Not detected
00:23:07:***********************************************************************
00:23:07:<config>
00:23:07:  <!-- Folding Core -->
00:23:07:  <checkpoint v='9'/>
00:23:07:
00:23:07:  <!-- Folding Slot Configuration -->
00:23:07:  <client-type v='bigadv'/>
00:23:07:  <gpu v='false'/>
00:23:07:  <max-packet-size v='big'/>
00:23:07:
00:23:07:  <!-- HTTP Server -->
00:23:07:  <allow v='127.0.0.1 137.65.135.16'/>
00:23:07:
00:23:07:  <!-- Network -->
00:23:07:  <proxy v=':8080'/>
00:23:07:
00:23:07:  <!-- Remote Command Server -->
00:23:07:  <command-allow-no-pass v='127.0.0.1 137.65.135.16'/>
00:23:07:
00:23:07:  <!-- Slot Control -->
00:23:07:  <pause-on-battery v='false'/>
00:23:07:  <power v='full'/>
00:23:07:
00:23:07:  <!-- User Information -->
00:23:07:  <passkey v='********************************'/>
00:23:07:  <team v='12841'/>
00:23:07:  <user v='jdouglas'/>
00:23:07:
00:23:07:  <!-- Folding Slots -->
00:23:07:  <slot id='0' type='CPU'/>
00:23:07:</config>
00:23:07:Switching to user fahclient
00:23:07:Trying to access database...
00:23:07:Successfully acquired database lock
00:23:07:Enabled folding slot 00: READY cpu:116
00:23:07:WU00:FS00:Starting
00:23:07:WU00:FS00:Running FahCore: /usr/bin/FAHCoreWrapper /var/lib/fahclient/cores/web.stanford.edu/~pande/Linux/AMD64/Core_a5.fah/FahCore_a5 -dir 00 -suffix 01 -version 704 -lifeline 28480 -checkpoint 9 -np 116
00:23:07:WU00:FS00:Started FahCore on PID 28503
00:23:07:WU00:FS00:Core PID:28510
00:23:07:WU00:FS00:FahCore 0xa5 started
00:23:07:WU00:FS00:0xa5:
00:23:07:WU00:FS00:0xa5:*------------------------------*
00:23:07:WU00:FS00:0xa5:Folding@Home Gromacs SMP Core
00:23:07:WU00:FS00:0xa5:Version 2.27 (Thu Feb 10 09:46:40 PST 2011)
00:23:07:WU00:FS00:0xa5:
00:23:07:WU00:FS00:0xa5:Preparing to commence simulation
00:23:07:WU00:FS00:0xa5:- Looking at optimizations...
00:23:07:WU00:FS00:0xa5:- Files status OK
00:23:09:WU00:FS00:0xa5:- Expanded 30307939 -> 33158020 (decompressed 109.4 percent)
00:23:09:WU00:FS00:0xa5:Called DecompressByteArray: compressed_data_size=30307939 data_size=33158020, decompressed_data_size=33158020 diff=0
00:23:09:WU00:FS00:0xa5:- Digital signature verified
00:23:09:WU00:FS00:0xa5:
00:23:09:WU00:FS00:0xa5:Project: 8101 (Run 27, Clone 1, Gen 177)
00:23:09:WU00:FS00:0xa5:
00:23:09:WU00:FS00:0xa5:Assembly optimizations on if available.
00:23:09:WU00:FS00:0xa5:Entering M.D.
00:23:16:WU00:FS00:0xa5:Mapping NT from 116 to 116 
01:27:25:WU00:FS00:0xa5:mdrun returned 255
01:27:25:WU00:FS00:0xa5:Going to send back what have done -- stepsTotalG=250000
01:27:25:WU00:FS00:0xa5:Work fraction=760209211392.0000 steps=250000.
01:27:29:WU00:FS00:0xa5:logfile size=6822 infoLength=6822 edr=25 trr=1
01:27:29:WU00:FS00:0xa5:logfile size: 6822 info=6822 bed=25 hdr=1
01:27:29:WU00:FS00:0xa5:- Writing 7360 bytes of core data to disk...
01:27:29:WU00:FS00:0xa5:Done: 6848 -> 2447 (compressed to 35.7 percent)
01:27:29:WU00:FS00:0xa5:  ... Done.
01:58:10:WU00:FS00:0xa5:
01:58:10:WU00:FS00:0xa5:Folding@home Core Shutdown: UNSTABLE_MACHINE
01:58:10:WARNING:WU00:FS00:FahCore returned: UNSTABLE_MACHINE (122 = 0x7a)
01:58:10:WU00:FS00:Sending unit results: id:00 state:SEND error:FAULTY project:8101 run:27 clone:1 gen:177 core:0xa5 unit:0x00000115088988e14f9997e5a2be4d8c
01:58:10:WU00:FS00:Uploading 2.89KiB to 128.143.231.201
01:58:10:WU00:FS00:Connecting to 128.143.231.201:8080
01:58:11:WU01:FS00:Connecting to 171.67.108.200:8080
01:58:11:WU00:FS00:Upload complete
01:58:11:WU00:FS00:Server responded WORK_ACK (400)
01:58:11:WU00:FS00:Cleaning up
01:58:11:WU01:FS00:Assigned to work server 128.143.231.201
01:58:11:WU01:FS00:Requesting new work unit for slot 00: READY cpu:116 from 128.143.231.201
01:58:11:WU01:FS00:Connecting to 128.143.231.201:8080
01:58:18:WU01:FS00:Downloading 28.91MiB
01:58:22:WU01:FS00:Download complete
01:58:22:WU01:FS00:Received Unit: id:01 state:DOWNLOAD error:NO_ERROR project:8101 run:26 clone:2 gen:491 core:0xa5 unit:0x0000028b088988e14f99973fa602c330
01:58:22:WU01:FS00:Starting
01:58:22:WU01:FS00:Running FahCore: /usr/bin/FAHCoreWrapper /var/lib/fahclient/cores/web.stanford.edu/~pande/Linux/AMD64/Core_a5.fah/FahCore_a5 -dir 01 -suffix 01 -version 704 -lifeline 28480 -checkpoint 9 -np 116
01:58:22:WU01:FS00:Started FahCore on PID 35508
01:58:22:WU01:FS00:Core PID:35512
01:58:22:WU01:FS00:FahCore 0xa5 started
01:58:23:WU01:FS00:0xa5:
01:58:23:WU01:FS00:0xa5:*------------------------------*
01:58:23:WU01:FS00:0xa5:Folding@Home Gromacs SMP Core
01:58:23:WU01:FS00:0xa5:Version 2.27 (Thu Feb 10 09:46:40 PST 2011)
01:58:23:WU01:FS00:0xa5:
01:58:23:WU01:FS00:0xa5:Preparing to commence simulation
01:58:23:WU01:FS00:0xa5:- Looking at optimizations...
01:58:23:WU01:FS00:0xa5:- Created dyn
01:58:23:WU01:FS00:0xa5:- Files status OK
01:58:25:WU01:FS00:0xa5:- Expanded 30309468 -> 33158020 (decompressed 109.3 percent)
01:58:25:WU01:FS00:0xa5:Called DecompressByteArray: compressed_data_size=30309468 data_size=33158020, decompressed_data_size=33158020 diff=0
01:58:25:WU01:FS00:0xa5:- Digital signature verified
01:58:25:WU01:FS00:0xa5:
01:58:25:WU01:FS00:0xa5:Project: 8101 (Run 26, Clone 2, Gen 491)
01:58:25:WU01:FS00:0xa5:
01:58:25:WU01:FS00:0xa5:Assembly optimizations on if available.
01:58:25:WU01:FS00:0xa5:Entering M.D.
01:58:31:WU01:FS00:0xa5:Mapping NT from 116 to 116 
03:02:42:WU01:FS00:0xa5:mdrun returned 255
03:02:42:WU01:FS00:0xa5:Going to send back what have done -- stepsTotalG=250000
03:02:42:WU01:FS00:0xa5:Work fraction=2108828942336.0000 steps=250000.
03:02:46:WU01:FS00:0xa5:logfile size=6823 infoLength=6823 edr=25 trr=1
03:02:46:WU01:FS00:0xa5:logfile size: 6823 info=6823 bed=25 hdr=1
03:02:46:WU01:FS00:0xa5:- Writing 7361 bytes of core data to disk...
03:02:46:WU01:FS00:0xa5:Done: 6849 -> 2459 (compressed to 35.9 percent)
03:02:46:WU01:FS00:0xa5:  ... Done.
03:33:41:WU01:FS00:0xa5:
03:33:41:WU01:FS00:0xa5:Folding@home Core Shutdown: UNSTABLE_MACHINE
03:33:42:WARNING:WU01:FS00:FahCore returned: UNSTABLE_MACHINE (122 = 0x7a)
03:33:42:WU01:FS00:Sending unit results: id:01 state:SEND error:FAULTY project:8101 run:26 clone:2 gen:491 core:0xa5 unit:0x0000028b088988e14f99973fa602c330
03:33:42:WU01:FS00:Uploading 2.90KiB to 128.143.231.201
03:33:42:WU01:FS00:Connecting to 128.143.231.201:8080
03:33:42:WU00:FS00:Connecting to 171.67.108.200:8080
03:33:42:WU01:FS00:Upload complete
03:33:42:WU01:FS00:Server responded WORK_ACK (400)
03:33:42:WU01:FS00:Cleaning up
03:33:42:WU00:FS00:Assigned to work server 128.143.231.201
03:33:42:WU00:FS00:Requesting new work unit for slot 00: READY cpu:116 from 128.143.231.201
03:33:42:WU00:FS00:Connecting to 128.143.231.201:8080
03:33:48:WU00:FS00:Downloading 28.90MiB
03:33:54:WU00:FS00:Download 50.60%
03:33:57:WU00:FS00:Download complete
03:33:57:WU00:FS00:Received Unit: id:00 state:DOWNLOAD error:NO_ERROR project:8101 run:8 clone:10 gen:465 core:0xa5 unit:0x000002e6088988e14f296baf6a940599
03:33:57:WU00:FS00:Starting
03:33:57:WU00:FS00:Running FahCore: /usr/bin/FAHCoreWrapper /var/lib/fahclient/cores/web.stanford.edu/~pande/Linux/AMD64/Core_a5.fah/FahCore_a5 -dir 00 -suffix 01 -version 704 -lifeline 28480 -checkpoint 9 -np 116
03:33:57:WU00:FS00:Started FahCore on PID 36279
03:33:57:WU00:FS00:Core PID:36283
03:33:57:WU00:FS00:FahCore 0xa5 started
03:33:58:WU00:FS00:0xa5:
03:33:58:WU00:FS00:0xa5:*------------------------------*
03:33:58:WU00:FS00:0xa5:Folding@Home Gromacs SMP Core
03:33:58:WU00:FS00:0xa5:Version 2.27 (Thu Feb 10 09:46:40 PST 2011)
03:33:58:WU00:FS00:0xa5:
03:33:58:WU00:FS00:0xa5:Preparing to commence simulation
03:33:58:WU00:FS00:0xa5:- Looking at optimizations...
03:33:58:WU00:FS00:0xa5:- Created dyn
03:33:58:WU00:FS00:0xa5:- Files status OK
03:33:59:WU00:FS00:0xa5:- Expanded 30305819 -> 33158020 (decompressed 109.4 percent)
03:33:59:WU00:FS00:0xa5:Called DecompressByteArray: compressed_data_size=30305819 data_size=33158020, decompressed_data_size=33158020 diff=0
03:34:00:WU00:FS00:0xa5:- Digital signature verified
03:34:00:WU00:FS00:0xa5:
03:34:00:WU00:FS00:0xa5:Project: 8101 (Run 8, Clone 10, Gen 465)
03:34:00:WU00:FS00:0xa5:
03:34:00:WU00:FS00:0xa5:Assembly optimizations on if available.
03:34:00:WU00:FS00:0xa5:Entering M.D.
03:34:06:WU00:FS00:0xa5:Mapping NT from 116 to 116 
04:37:57:WU00:FS00:0xa5:mdrun returned 255
04:37:57:WU00:FS00:0xa5:Going to send back what have done -- stepsTotalG=250000
04:37:57:WU00:FS00:0xa5:Work fraction=1997159792640.0000 steps=250000.
04:38:01:WU00:FS00:0xa5:logfile size=6823 infoLength=6823 edr=25 trr=1
04:38:01:WU00:FS00:0xa5:logfile size: 6823 info=6823 bed=25 hdr=1
04:38:01:WU00:FS00:0xa5:- Writing 7361 bytes of core data to disk...
04:38:01:WU00:FS00:0xa5:Done: 6849 -> 2448 (compressed to 35.7 percent)
04:38:01:WU00:FS00:0xa5:  ... Done.
05:11:07:WU00:FS00:0xa5:
05:11:07:WU00:FS00:0xa5:Folding@home Core Shutdown: UNSTABLE_MACHINE
05:11:07:WARNING:WU00:FS00:FahCore returned: UNSTABLE_MACHINE (122 = 0x7a)
05:11:07:WU00:FS00:Sending unit results: id:00 state:SEND error:FAULTY project:8101 run:8 clone:10 gen:465 core:0xa5 unit:0x000002e6088988e14f296baf6a940599
05:11:07:WU00:FS00:Uploading 2.89KiB to 128.143.231.201
05:11:07:WU00:FS00:Connecting to 128.143.231.201:8080
05:11:07:WU00:FS00:Upload complete
05:11:07:WU00:FS00:Server responded WORK_ACK (400)
05:11:08:WU00:FS00:Cleaning up
05:11:08:WU01:FS00:Connecting to 171.67.108.200:8080
05:11:08:WU01:FS00:Assigned to work server 128.143.231.201
05:11:08:WU01:FS00:Requesting new work unit for slot 00: READY cpu:116 from 128.143.231.201
05:11:08:WU01:FS00:Connecting to 128.143.231.201:8080
05:11:14:WU01:FS00:Downloading 28.91MiB
05:11:20:WU01:FS00:Download 49.72%
05:11:24:WU01:FS00:Download complete
05:11:24:WU01:FS00:Received Unit: id:01 state:DOWNLOAD error:NO_ERROR project:8101 run:25 clone:2 gen:304 core:0xa5 unit:0x000001af088988e14f9996956b74f85a
05:11:24:WU01:FS00:Starting
05:11:24:WU01:FS00:Running FahCore: /usr/bin/FAHCoreWrapper /var/lib/fahclient/cores/web.stanford.edu/~pande/Linux/AMD64/Core_a5.fah/FahCore_a5 -dir 01 -suffix 01 -version 704 -lifeline 28480 -checkpoint 9 -np 116
05:11:24:WU01:FS00:Started FahCore on PID 37177
05:11:24:WU01:FS00:Core PID:37181
05:11:24:WU01:FS00:FahCore 0xa5 started
05:11:25:WU01:FS00:0xa5:
05:11:25:WU01:FS00:0xa5:*------------------------------*
05:11:25:WU01:FS00:0xa5:Folding@Home Gromacs SMP Core
05:11:25:WU01:FS00:0xa5:Version 2.27 (Thu Feb 10 09:46:40 PST 2011)
05:11:25:WU01:FS00:0xa5:
05:11:25:WU01:FS00:0xa5:Preparing to commence simulation
05:11:25:WU01:FS00:0xa5:- Looking at optimizations...
05:11:25:WU01:FS00:0xa5:- Created dyn
05:11:25:WU01:FS00:0xa5:- Files status OK
05:11:26:WU01:FS00:0xa5:- Expanded 30317159 -> 33158020 (decompressed 109.3 percent)
05:11:26:WU01:FS00:0xa5:Called DecompressByteArray: compressed_data_size=30317159 data_size=33158020, decompressed_data_size=33158020 diff=0
05:11:27:WU01:FS00:0xa5:- Digital signature verified
05:11:27:WU01:FS00:0xa5:
05:11:27:WU01:FS00:0xa5:Project: 8101 (Run 25, Clone 2, Gen 304)
05:11:27:WU01:FS00:0xa5:
05:11:27:WU01:FS00:0xa5:Assembly optimizations on if available.
05:11:27:WU01:FS00:0xa5:Entering M.D.
05:11:33:WU01:FS00:0xa5:Mapping NT from 116 to 116 
06:15:48:WU01:FS00:0xa5:mdrun returned 255
06:15:48:WU01:FS00:0xa5:Going to send back what have done -- stepsTotalG=250000
06:15:48:WU01:FS00:0xa5:Work fraction=1305670057984.0000 steps=250000.
06:15:51:WU01:FS00:0xa5:logfile size=6822 infoLength=6822 edr=25 trr=1
06:15:51:WU01:FS00:0xa5:logfile size: 6822 info=6822 bed=25 hdr=1
06:15:52:WU01:FS00:0xa5:- Writing 7360 bytes of core data to disk...
06:15:52:WU01:FS00:0xa5:Done: 6848 -> 2456 (compressed to 35.8 percent)
06:15:52:WU01:FS00:0xa5:  ... Done.
******************************* Date: 2014-07-18 *******************************
06:47:12:WU01:FS00:0xa5:
06:47:12:WU01:FS00:0xa5:Folding@home Core Shutdown: UNSTABLE_MACHINE
06:47:13:WARNING:WU01:FS00:FahCore returned: UNSTABLE_MACHINE (122 = 0x7a)
06:47:13:WU01:FS00:Sending unit results: id:01 state:SEND error:FAULTY project:8101 run:25 clone:2 gen:304 core:0xa5 unit:0x000001af088988e14f9996956b74f85a
06:47:13:WU01:FS00:Uploading 2.90KiB to 128.143.231.201
06:47:13:WU01:FS00:Connecting to 128.143.231.201:8080
06:47:13:WU00:FS00:Connecting to 171.67.108.200:8080
06:47:13:WU01:FS00:Upload complete
06:47:13:WU01:FS00:Server responded WORK_ACK (400)
06:47:13:WU01:FS00:Cleaning up
06:47:14:WU00:FS00:Assigned to work server 128.143.231.201
06:47:14:WU00:FS00:Requesting new work unit for slot 00: READY cpu:116 from 128.143.231.201
06:47:14:WU00:FS00:Connecting to 128.143.231.201:8080
06:47:21:WU00:FS00:Downloading 28.94MiB
06:47:26:WU00:FS00:Download complete
06:47:26:WU00:FS00:Received Unit: id:00 state:DOWNLOAD error:NO_ERROR project:8103 run:0 clone:17 gen:419 core:0xa5 unit:0x00000271088988e1511d1e5f01bcd840
06:47:26:WU00:FS00:Starting
06:47:26:WU00:FS00:Running FahCore: /usr/bin/FAHCoreWrapper /var/lib/fahclient/cores/web.stanford.edu/~pande/Linux/AMD64/Core_a5.fah/FahCore_a5 -dir 00 -suffix 01 -version 704 -lifeline 28480 -checkpoint 9 -np 116
06:47:26:WU00:FS00:Started FahCore on PID 37949
06:47:26:WU00:FS00:Core PID:37953
06:47:26:WU00:FS00:FahCore 0xa5 started
06:47:26:WU00:FS00:0xa5:
06:47:26:WU00:FS00:0xa5:*------------------------------*
06:47:26:WU00:FS00:0xa5:Folding@Home Gromacs SMP Core
06:47:26:WU00:FS00:0xa5:Version 2.27 (Thu Feb 10 09:46:40 PST 2011)
06:47:26:WU00:FS00:0xa5:
06:47:26:WU00:FS00:0xa5:Preparing to commence simulation
06:47:26:WU00:FS00:0xa5:- Looking at optimizations...
06:47:26:WU00:FS00:0xa5:- Created dyn
06:47:26:WU00:FS00:0xa5:- Files status OK
06:47:28:WU00:FS00:0xa5:- Expanded 30350220 -> 33163648 (decompressed 109.2 percent)
06:47:28:WU00:FS00:0xa5:Called DecompressByteArray: compressed_data_size=30350220 data_size=33163648, decompressed_data_size=33163648 diff=0
06:47:28:WU00:FS00:0xa5:- Digital signature verified
06:47:28:WU00:FS00:0xa5:
06:47:28:WU00:FS00:0xa5:Project: 8103 (Run 0, Clone 17, Gen 419)
06:47:28:WU00:FS00:0xa5:
06:47:28:WU00:FS00:0xa5:Assembly optimizations on if available.
06:47:28:WU00:FS00:0xa5:Entering M.D.
06:47:35:WU00:FS00:0xa5:Mapping NT from 116 to 116 
07:21:43:WU00:FS00:0xa5:mdrun returned 255
07:21:43:WU00:FS00:0xa5:Going to send back what have done -- stepsTotalG=250000
07:21:43:WU00:FS00:0xa5:Work fraction=1799591297024.0000 steps=250000.
07:21:47:WU00:FS00:0xa5:logfile size=6823 infoLength=6823 edr=25 trr=1
07:21:47:WU00:FS00:0xa5:logfile size: 6823 info=6823 bed=25 hdr=1
07:21:47:WU00:FS00:0xa5:- Writing 7361 bytes of core data to disk...
07:21:47:WU00:FS00:0xa5:Done: 6849 -> 2453 (compressed to 35.8 percent)
07:21:47:WU00:FS00:0xa5:  ... Done.
07:52:25:WU00:FS00:0xa5:
07:52:25:WU00:FS00:0xa5:Folding@home Core Shutdown: UNSTABLE_MACHINE
07:52:25:WARNING:WU00:FS00:FahCore returned: UNSTABLE_MACHINE (122 = 0x7a)
07:52:25:WU00:FS00:Sending unit results: id:00 state:SEND error:FAULTY project:8103 run:0 clone:17 gen:419 core:0xa5 unit:0x00000271088988e1511d1e5f01bcd840
07:52:25:WU00:FS00:Uploading 2.90KiB to 128.143.231.201
07:52:25:WU00:FS00:Connecting to 128.143.231.201:8080
07:52:26:WU00:FS00:Upload complete
07:52:26:WU00:FS00:Server responded WORK_ACK (400)
07:52:26:WU00:FS00:Cleaning up
07:52:26:WU01:FS00:Connecting to 171.67.108.200:8080
07:52:26:WU01:FS00:Assigned to work server 128.143.231.201
07:52:26:WU01:FS00:Requesting new work unit for slot 00: READY cpu:116 from 128.143.231.201
07:52:26:WU01:FS00:Connecting to 128.143.231.201:8080
07:52:32:WU01:FS00:Downloading 28.91MiB
07:52:38:WU01:FS00:Download 95.99%
07:52:38:WU01:FS00:Download complete
07:52:38:WU01:FS00:Received Unit: id:01 state:DOWNLOAD error:NO_ERROR project:8101 run:25 clone:3 gen:483 core:0xa5 unit:0x000002b2088988e14f99969a8a167674
07:52:38:WU01:FS00:Starting
07:52:38:WU01:FS00:Running FahCore: /usr/bin/FAHCoreWrapper /var/lib/fahclient/cores/web.stanford.edu/~pande/Linux/AMD64/Core_a5.fah/FahCore_a5 -dir 01 -suffix 01 -version 704 -lifeline 28480 -checkpoint 9 -np 116
07:52:38:WU01:FS00:Started FahCore on PID 38495
07:52:38:WU01:FS00:Core PID:38499
07:52:38:WU01:FS00:FahCore 0xa5 started
07:52:38:WU01:FS00:0xa5:
07:52:38:WU01:FS00:0xa5:*------------------------------*
07:52:38:WU01:FS00:0xa5:Folding@Home Gromacs SMP Core
07:52:38:WU01:FS00:0xa5:Version 2.27 (Thu Feb 10 09:46:40 PST 2011)
07:52:38:WU01:FS00:0xa5:
07:52:38:WU01:FS00:0xa5:Preparing to commence simulation
07:52:38:WU01:FS00:0xa5:- Looking at optimizations...
07:52:38:WU01:FS00:0xa5:- Created dyn
07:52:38:WU01:FS00:0xa5:- Files status OK
07:52:40:WU01:FS00:0xa5:- Expanded 30312114 -> 33158020 (decompressed 109.3 percent)
07:52:40:WU01:FS00:0xa5:Called DecompressByteArray: compressed_data_size=30312114 data_size=33158020, decompressed_data_size=33158020 diff=0
07:52:40:WU01:FS00:0xa5:- Digital signature verified
07:52:40:WU01:FS00:0xa5:
07:52:40:WU01:FS00:0xa5:Project: 8101 (Run 25, Clone 3, Gen 483)
07:52:40:WU01:FS00:0xa5:
07:52:40:WU01:FS00:0xa5:Assembly optimizations on if available.
07:52:40:WU01:FS00:0xa5:Entering M.D.
07:52:47:WU01:FS00:0xa5:Mapping NT from 116 to 116 
08:56:56:WU01:FS00:0xa5:mdrun returned 255
08:56:56:WU01:FS00:0xa5:Going to send back what have done -- stepsTotalG=250000
08:56:56:WU01:FS00:0xa5:Work fraction=2074469203968.0000 steps=250000.
08:57:00:WU01:FS00:0xa5:logfile size=6823 infoLength=6823 edr=25 trr=1
08:57:00:WU01:FS00:0xa5:logfile size: 6823 info=6823 bed=25 hdr=1
08:57:00:WU01:FS00:0xa5:- Writing 7361 bytes of core data to disk...
08:57:00:WU01:FS00:0xa5:Done: 6849 -> 2451 (compressed to 35.7 percent)
08:57:00:WU01:FS00:0xa5:  ... Done.
09:29:18:WU01:FS00:0xa5:
09:29:18:WU01:FS00:0xa5:Folding@home Core Shutdown: UNSTABLE_MACHINE
09:29:18:WARNING:WU01:FS00:FahCore returned: UNSTABLE_MACHINE (122 = 0x7a)
09:29:18:WU01:FS00:Sending unit results: id:01 state:SEND error:FAULTY project:8101 run:25 clone:3 gen:483 core:0xa5 unit:0x000002b2088988e14f99969a8a167674
09:29:18:WU01:FS00:Uploading 2.89KiB to 128.143.231.201
09:29:18:WU01:FS00:Connecting to 128.143.231.201:8080
09:29:18:WU00:FS00:Connecting to 171.67.108.200:8080
09:29:18:WU01:FS00:Upload complete
09:29:18:WU01:FS00:Server responded WORK_ACK (400)
09:29:18:WU01:FS00:Cleaning up
09:29:19:WU00:FS00:Assigned to work server 128.143.231.201
09:29:19:WU00:FS00:Requesting new work unit for slot 00: READY cpu:116 from 128.143.231.201
09:29:19:WU00:FS00:Connecting to 128.143.231.201:8080
09:29:25:WU00:FS00:Downloading 28.90MiB
09:29:30:WU00:FS00:Download complete
09:29:30:WU00:FS00:Received Unit: id:00 state:DOWNLOAD error:NO_ERROR project:8101 run:1 clone:7 gen:293 core:0xa5 unit:0x00000195088988e14f0e22023cf83a43
09:29:30:WU00:FS00:Starting
09:29:30:WU00:FS00:Running FahCore: /usr/bin/FAHCoreWrapper /var/lib/fahclient/cores/web.stanford.edu/~pande/Linux/AMD64/Core_a5.fah/FahCore_a5 -dir 00 -suffix 01 -version 704 -lifeline 28480 -checkpoint 9 -np 116
09:29:30:WU00:FS00:Started FahCore on PID 39244
09:29:30:WU00:FS00:Core PID:39248
09:29:30:WU00:FS00:FahCore 0xa5 started
09:29:31:WU00:FS00:0xa5:
09:29:31:WU00:FS00:0xa5:*------------------------------*
09:29:31:WU00:FS00:0xa5:Folding@Home Gromacs SMP Core
09:29:31:WU00:FS00:0xa5:Version 2.27 (Thu Feb 10 09:46:40 PST 2011)
09:29:31:WU00:FS00:0xa5:
09:29:31:WU00:FS00:0xa5:Preparing to commence simulation
09:29:31:WU00:FS00:0xa5:- Looking at optimizations...
09:29:31:WU00:FS00:0xa5:- Created dyn
09:29:31:WU00:FS00:0xa5:- Files status OK
09:29:32:WU00:FS00:0xa5:- Expanded 30304783 -> 33158020 (decompressed 109.4 percent)
09:29:32:WU00:FS00:0xa5:Called DecompressByteArray: compressed_data_size=30304783 data_size=33158020, decompressed_data_size=33158020 diff=0
09:29:33:WU00:FS00:0xa5:- Digital signature verified
09:29:33:WU00:FS00:0xa5:
09:29:33:WU00:FS00:0xa5:Project: 8101 (Run 1, Clone 7, Gen 293)
09:29:33:WU00:FS00:0xa5:
09:29:33:WU00:FS00:0xa5:Assembly optimizations on if available.
09:29:33:WU00:FS00:0xa5:Entering M.D.
09:29:39:WU00:FS00:0xa5:Mapping NT from 116 to 116 
10:33:49:WU00:FS00:0xa5:mdrun returned 255
10:33:49:WU00:FS00:0xa5:Going to send back what have done -- stepsTotalG=250000
10:33:49:WU00:FS00:0xa5:Work fraction=1258425417728.0000 steps=250000.
10:33:53:WU00:FS00:0xa5:logfile size=6822 infoLength=6822 edr=25 trr=1
10:33:53:WU00:FS00:0xa5:logfile size: 6822 info=6822 bed=25 hdr=1
10:33:53:WU00:FS00:0xa5:- Writing 7360 bytes of core data to disk...
10:33:53:WU00:FS00:0xa5:Done: 6848 -> 2450 (compressed to 35.7 percent)
10:33:53:WU00:FS00:0xa5:  ... Done.
11:04:51:WU00:FS00:0xa5:
11:04:51:WU00:FS00:0xa5:Folding@home Core Shutdown: UNSTABLE_MACHINE
11:04:51:WARNING:WU00:FS00:FahCore returned: UNSTABLE_MACHINE (122 = 0x7a)
11:04:51:WU00:FS00:Sending unit results: id:00 state:SEND error:FAULTY project:8101 run:1 clone:7 gen:293 core:0xa5 unit:0x00000195088988e14f0e22023cf83a43
11:04:51:WU00:FS00:Uploading 2.89KiB to 128.143.231.201
11:04:51:WU00:FS00:Connecting to 128.143.231.201:8080
11:04:52:WU00:FS00:Upload complete
11:04:52:WU00:FS00:Server responded WORK_ACK (400)
11:04:52:WU00:FS00:Cleaning up
11:04:52:WU01:FS00:Connecting to 171.67.108.200:8080
11:04:52:WU01:FS00:Assigned to work server 128.143.231.201
11:04:52:WU01:FS00:Requesting new work unit for slot 00: READY cpu:116 from 128.143.231.201
11:04:52:WU01:FS00:Connecting to 128.143.231.201:8080
11:04:58:WU01:FS00:Downloading 28.91MiB
11:05:02:WU01:FS00:Download complete
11:05:02:WU01:FS00:Received Unit: id:01 state:DOWNLOAD error:NO_ERROR project:8101 run:3 clone:3 gen:498 core:0xa5 unit:0x00000358088988e14f0e233bb91642a9
11:05:02:WU01:FS00:Starting
11:05:02:WU01:FS00:Running FahCore: /usr/bin/FAHCoreWrapper /var/lib/fahclient/cores/web.stanford.edu/~pande/Linux/AMD64/Core_a5.fah/FahCore_a5 -dir 01 -suffix 01 -version 704 -lifeline 28480 -checkpoint 9 -np 116
11:05:02:WU01:FS00:Started FahCore on PID 40005
11:05:03:WU01:FS00:Core PID:40009
11:05:03:WU01:FS00:FahCore 0xa5 started
11:05:03:WU01:FS00:0xa5:
11:05:03:WU01:FS00:0xa5:*------------------------------*
11:05:03:WU01:FS00:0xa5:Folding@Home Gromacs SMP Core
11:05:03:WU01:FS00:0xa5:Version 2.27 (Thu Feb 10 09:46:40 PST 2011)
11:05:03:WU01:FS00:0xa5:
11:05:03:WU01:FS00:0xa5:Preparing to commence simulation
11:05:03:WU01:FS00:0xa5:- Looking at optimizations...
11:05:03:WU01:FS00:0xa5:- Created dyn
11:05:03:WU01:FS00:0xa5:- Files status OK
11:05:05:WU01:FS00:0xa5:- Expanded 30313352 -> 33158020 (decompressed 109.3 percent)
11:05:05:WU01:FS00:0xa5:Called DecompressByteArray: compressed_data_size=30313352 data_size=33158020, decompressed_data_size=33158020 diff=0
11:05:05:WU01:FS00:0xa5:- Digital signature verified
11:05:05:WU01:FS00:0xa5:
11:05:05:WU01:FS00:0xa5:Project: 8101 (Run 3, Clone 3, Gen 498)
11:05:05:WU01:FS00:0xa5:
11:05:05:WU01:FS00:0xa5:Assembly optimizations on if available.
11:05:05:WU01:FS00:0xa5:Entering M.D.
11:05:11:WU01:FS00:0xa5:Mapping NT from 116 to 116 
12:08:51:WU01:FS00:0xa5:mdrun returned 255
12:08:51:WU01:FS00:0xa5:Going to send back what have done -- stepsTotalG=250000
12:08:51:WU01:FS00:0xa5:Work fraction=2138893713408.0000 steps=250000.
12:08:55:WU01:FS00:0xa5:logfile size=6823 infoLength=6823 edr=25 trr=1
12:08:55:WU01:FS00:0xa5:logfile size: 6823 info=6823 bed=25 hdr=1
12:08:55:WU01:FS00:0xa5:- Writing 7361 bytes of core data to disk...
12:08:55:WU01:FS00:0xa5:Done: 6849 -> 2452 (compressed to 35.8 percent)
12:08:55:WU01:FS00:0xa5:  ... Done.
12:40:36:WU01:FS00:0xa5:
12:40:36:WU01:FS00:0xa5:Folding@home Core Shutdown: UNSTABLE_MACHINE
12:40:36:WARNING:WU01:FS00:FahCore returned: UNSTABLE_MACHINE (122 = 0x7a)
12:40:36:WU01:FS00:Sending unit results: id:01 state:SEND error:FAULTY project:8101 run:3 clone:3 gen:498 core:0xa5 unit:0x00000358088988e14f0e233bb91642a9
12:40:36:WU01:FS00:Uploading 2.89KiB to 128.143.231.201
12:40:36:WU01:FS00:Connecting to 128.143.231.201:8080
12:40:37:WU00:FS00:Connecting to 171.67.108.200:8080
12:40:37:WU01:FS00:Upload complete
12:40:37:WU01:FS00:Server responded WORK_ACK (400)
12:40:37:WU01:FS00:Cleaning up
12:40:37:WU00:FS00:Assigned to work server 128.143.231.201
12:40:37:WU00:FS00:Requesting new work unit for slot 00: READY cpu:116 from 128.143.231.201
12:40:37:WU00:FS00:Connecting to 128.143.231.201:8080
12:40:46:WU00:FS00:Downloading 28.90MiB
12:40:50:WU00:FS00:Download complete
12:40:50:WU00:FS00:Received Unit: id:00 state:DOWNLOAD error:NO_ERROR project:8105 run:0 clone:72 gen:449 core:0xa5 unit:0x00000232088988e15164334474066162
12:40:50:WU00:FS00:Starting
12:40:50:WU00:FS00:Running FahCore: /usr/bin/FAHCoreWrapper /var/lib/fahclient/cores/web.stanford.edu/~pande/Linux/AMD64/Core_a5.fah/FahCore_a5 -dir 00 -suffix 01 -version 704 -lifeline 28480 -checkpoint 9 -np 116
12:40:50:WU00:FS00:Started FahCore on PID 40914
12:40:50:WU00:FS00:Core PID:40918
12:40:50:WU00:FS00:FahCore 0xa5 started
12:40:50:WU00:FS00:0xa5:
12:40:50:WU00:FS00:0xa5:*------------------------------*
12:40:50:WU00:FS00:0xa5:Folding@Home Gromacs SMP Core
12:40:50:WU00:FS00:0xa5:Version 2.27 (Thu Feb 10 09:46:40 PST 2011)
12:40:50:WU00:FS00:0xa5:
12:40:50:WU00:FS00:0xa5:Preparing to commence simulation
12:40:50:WU00:FS00:0xa5:- Looking at optimizations...
12:40:50:WU00:FS00:0xa5:- Created dyn
12:40:50:WU00:FS00:0xa5:- Files status OK
12:40:52:WU00:FS00:0xa5:- Expanded 30303453 -> 33130012 (decompressed 109.3 percent)
12:40:52:WU00:FS00:0xa5:Called DecompressByteArray: compressed_data_size=30303453 data_size=33130012, decompressed_data_size=33130012 diff=0
12:40:52:WU00:FS00:0xa5:- Digital signature verified
12:40:52:WU00:FS00:0xa5:
12:40:52:WU00:FS00:0xa5:Project: 8105 (Run 0, Clone 72, Gen 449)
12:40:52:WU00:FS00:0xa5:
12:40:52:WU00:FS00:0xa5:Assembly optimizations on if available.
12:40:52:WU00:FS00:0xa5:Entering M.D.
12:40:59:WU00:FS00:0xa5:Mapping NT from 116 to 116 
******************************* Date: 2014-07-18 *******************************
13:14:35:WU00:FS00:0xa5:mdrun returned 255
13:14:35:WU00:FS00:0xa5:Going to send back what have done -- stepsTotalG=250000
13:14:35:WU00:FS00:0xa5:Work fraction=1928440315904.0000 steps=250000.
13:14:39:WU00:FS00:0xa5:logfile size=6823 infoLength=6823 edr=25 trr=1
13:14:39:WU00:FS00:0xa5:logfile size: 6823 info=6823 bed=25 hdr=1
13:14:39:WU00:FS00:0xa5:- Writing 7361 bytes of core data to disk...
13:14:40:WU00:FS00:0xa5:Done: 6849 -> 2446 (compressed to 35.7 percent)
13:14:40:WU00:FS00:0xa5:  ... Done.
13:44:33:WU00:FS00:0xa5:
13:44:33:WU00:FS00:0xa5:Folding@home Core Shutdown: UNSTABLE_MACHINE
13:44:33:WARNING:WU00:FS00:FahCore returned: UNSTABLE_MACHINE (122 = 0x7a)
13:44:33:WU00:FS00:Sending unit results: id:00 state:SEND error:FAULTY project:8105 run:0 clone:72 gen:449 core:0xa5 unit:0x00000232088988e15164334474066162
13:44:33:WU00:FS00:Uploading 2.89KiB to 128.143.231.201
13:44:33:WU00:FS00:Connecting to 128.143.231.201:8080
13:44:34:WU01:FS00:Connecting to 171.67.108.200:8080
13:44:34:WU00:FS00:Upload complete
13:44:34:WU00:FS00:Server responded WORK_ACK (400)
13:44:34:WU00:FS00:Cleaning up
13:44:34:WU01:FS00:Assigned to work server 128.143.231.201
13:44:34:WU01:FS00:Requesting new work unit for slot 00: READY cpu:116 from 128.143.231.201
13:44:34:WU01:FS00:Connecting to 128.143.231.201:8080
13:44:41:WU01:FS00:Downloading 28.91MiB
13:44:44:WU01:FS00:Download complete
13:44:44:WU01:FS00:Received Unit: id:01 state:DOWNLOAD error:NO_ERROR project:8101 run:8 clone:6 gen:390 core:0xa5 unit:0x00000223088988e14f296b98661443f1
13:44:44:WU01:FS00:Starting
13:44:44:WU01:FS00:Running FahCore: /usr/bin/FAHCoreWrapper /var/lib/fahclient/cores/web.stanford.edu/~pande/Linux/AMD64/Core_a5.fah/FahCore_a5 -dir 01 -suffix 01 -version 704 -lifeline 28480 -checkpoint 9 -np 116
13:44:44:WU01:FS00:Started FahCore on PID 41462
13:44:44:WU01:FS00:Core PID:41466
13:44:44:WU01:FS00:FahCore 0xa5 started
13:44:44:WU01:FS00:0xa5:
13:44:44:WU01:FS00:0xa5:*------------------------------*
13:44:44:WU01:FS00:0xa5:Folding@Home Gromacs SMP Core
13:44:44:WU01:FS00:0xa5:Version 2.27 (Thu Feb 10 09:46:40 PST 2011)
13:44:44:WU01:FS00:0xa5:
13:44:44:WU01:FS00:0xa5:Preparing to commence simulation
13:44:44:WU01:FS00:0xa5:- Looking at optimizations...
13:44:44:WU01:FS00:0xa5:- Created dyn
13:44:44:WU01:FS00:0xa5:- Files status OK
13:44:46:WU01:FS00:0xa5:- Expanded 30315714 -> 33158020 (decompressed 109.3 percent)
13:44:46:WU01:FS00:0xa5:Called DecompressByteArray: compressed_data_size=30315714 data_size=33158020, decompressed_data_size=33158020 diff=0
13:44:46:WU01:FS00:0xa5:- Digital signature verified
13:44:46:WU01:FS00:0xa5:
13:44:46:WU01:FS00:0xa5:Project: 8101 (Run 8, Clone 6, Gen 390)
13:44:46:WU01:FS00:0xa5:
13:44:46:WU01:FS00:0xa5:Assembly optimizations on if available.
13:44:46:WU01:FS00:0xa5:Entering M.D.
13:44:53:WU01:FS00:0xa5:Mapping NT from 116 to 116 
14:48:18:WU01:FS00:0xa5:mdrun returned 255
14:48:18:WU01:FS00:0xa5:Going to send back what have done -- stepsTotalG=250000
14:48:18:WU01:FS00:0xa5:Work fraction=1675037245440.0000 steps=250000.
14:48:23:WU01:FS00:0xa5:logfile size=6822 infoLength=6822 edr=25 trr=1
14:48:23:WU01:FS00:0xa5:logfile size: 6822 info=6822 bed=25 hdr=1
14:48:23:WU01:FS00:0xa5:- Writing 7360 bytes of core data to disk...
14:48:23:WU01:FS00:0xa5:Done: 6848 -> 2454 (compressed to 35.8 percent)
14:48:23:WU01:FS00:0xa5:  ... Done.
15:18:24:WARNING:WU01:FS00:FahCore returned: UNSTABLE_MACHINE (122 = 0x7a)
15:18:24:WU01:FS00:Sending unit results: id:01 state:SEND error:FAULTY project:8101 run:8 clone:6 gen:390 core:0xa5 unit:0x00000223088988e14f296b98661443f1
15:18:24:WU01:FS00:Uploading 2.90KiB to 128.143.231.201
15:18:24:WU01:FS00:Connecting to 128.143.231.201:8080
15:18:25:WU01:FS00:Upload complete
15:18:25:WU01:FS00:Server responded WORK_ACK (400)
15:18:25:WU01:FS00:Cleaning up
By the way, the same thing happens with less cpus assigned as well.
Image
P5-133XL
Posts: 2948
Joined: Sun Dec 02, 2007 4:36 am
Hardware configuration: Machine #1:

Intel Q9450; 2x2GB=8GB Ram; Gigabyte GA-X48-DS4 Motherboard; PC Power and Cooling Q750 PS; 2x GTX 460; Windows Server 2008 X64 (SP1).

Machine #2:

Intel Q6600; 2x2GB=4GB Ram; Gigabyte GA-X48-DS4 Motherboard; PC Power and Cooling Q750 PS; 2x GTX 460 video card; Windows 7 X64.

Machine 3:

Dell Dimension 8400, 3.2GHz P4 4x512GB Ram, Video card GTX 460, Windows 7 X32

I am currently folding just on the 5x GTX 460's for aprox. 70K PPD
Location: Salem. OR USA

Re: SUSE VMs on Xen

Post by P5-133XL »

Interesting question that I have no answer for.
Image
bigvbguy
Posts: 22
Joined: Sat Dec 29, 2007 8:15 am

Re: SUSE VMs on Xen

Post by bigvbguy »

One more thing -- I'm suspicious that this is somehow network related since the paravirtualized network often behaves a bit differently than FV. Is there any way to debug this to find out if network binding is the issue? Maybe a higher log level?
Image
bigvbguy
Posts: 22
Joined: Sat Dec 29, 2007 8:15 am

Re: SUSE VMs on Xen

Post by bigvbguy »

Well, in posting this, I thought of a couple other ideas to try, and one of them worked. I switched the VM to a static IP address configuration (instead of dhcp) and it seems to be folding correctly now. I'm still not sure what caused the problem, but at least I have a work around. I'd welcome feedback if anyone has an idea why a static ip address would make a difference.
Image
bruce
Posts: 20824
Joined: Thu Nov 29, 2007 10:13 pm
Location: So. Cal.

Re: SUSE VMs on Xen

Post by bruce »

You have 116 CPUs. Are those virtualized CPUs that are mapped onto a smaller number of real CPUs? If so, that's your problem. There is no value whatsoever to running FAH on any number of CPUs which cannot be dedicated to the folding process. Performance degrades first, and then the virtual machine thrashes, trying to complete all 116 threads simultaneously, leading to a hopeless deadlock. [This may not be your problem, but I have to ask.]

How much RAM is available to FAH? Again, each of those 116 threads needs a segment of real RAM. How much of your 8 GB of virtual RAM map onto real RAM that's available on a semi-dedicated basis?

I would try using a smaller number of CPUs and find out if it works. Start with maybe 12 and then gradually increase the number until you find a point where problems start happening.

Anything is possible, of course, but I wouldn't count on networking being the problem. In the portion of log that you posted, FAHClient successfully uploaded and downloaded WUs. It then failed after it began to process the WU. At that point, it didn't need anything further from the internet until after the error.
bigvbguy
Posts: 22
Joined: Sat Dec 29, 2007 8:15 am

Re: SUSE VMs on Xen

Post by bigvbguy »

bruce wrote:You have 116 CPUs. Are those virtualized CPUs that are mapped onto a smaller number of real CPUs? If so, that's your problem. There is no value whatsoever to running FAH on any number of CPUs which cannot be dedicated to the folding process. Performance degrades first, and then the virtual machine thrashes, trying to complete all 116 threads simultaneously, leading to a hopeless deadlock. [This may not be your problem, but I have to ask.]
No, the physical machine has 144 CPUs, and I have this VM pinned to 116 of them.
bruce wrote:How much RAM is available to FAH? Again, each of those 116 threads needs a segment of real RAM. How much of your 8 GB of virtual RAM map onto real RAM that's available on a semi-dedicated basis?
The VM has 8GB of dedicated physical RAM (no memory overcommit in this xen configuration), but I had given it as much as 64 GB before, with no change.
bruce wrote:I would try using a smaller number of CPUs and find out if it works. Start with maybe 12 and then gradually increase the number until you find a point where problems start happening.

Anything is possible, of course, but I wouldn't count on networking being the problem. In the portion of log that you posted, FAHClient successfully uploaded and downloaded WUs. It then failed after it began to process the WU. At that point, it didn't need anything further from the internet until after the error.
As I mentioned, I had tried as few as 8 vcpus with no change in behavior. It wasn't until I changed the network to a static configuration that it started working (even with 116 vcpus, although I had to spin up a different VM on that machine so it's now only running with 96).
Image
bigvbguy
Posts: 22
Joined: Sat Dec 29, 2007 8:15 am

Re: SUSE VMs on Xen

Post by bigvbguy »

bruce, How much RAM would you recommend per core?
Image
bollix47
Posts: 2957
Joined: Sun Dec 02, 2007 5:04 am
Location: Canada

Re: SUSE VMs on Xen

Post by bollix47 »

FYI

My 64 core system running Ubuntu 14.04 natively uses ~62.5MB per core when folding bigadv for a total of ~4GB.
Image
folding_hoomer
Posts: 349
Joined: Sun Feb 10, 2013 6:06 pm
Hardware configuration: Sys 1: I7 2700K@4,4GHz with NH-C14
8GB G.Skill Sniper DDR3 1866MHz CL 9-10-9-28
MSI Z68A-GD65 (G3), various operating systems (WinXP, Ubuntu: 10.4.3 LTS, 12.04.2 LTS)
Optional: GTX560TI 448@stock/OC´d

Sys 2: I7 3930K@4,4GHz with Corsair H110
16GB G.Skill Ripjaws X DDR3 1866MHz CL 9-10-9-28
ASUS Ranpage IV Formula, Ubuntu 10.10

Sys 3 i7 875K@3,826 GHz with Scythe Mine2
8GB G.Skill Sniper DDR3 1866MHz CL 9-10-9-28
MSI P55-GD80, Win7 64Bit Pro
Sapphire Radeon HD5870@1,163V 900/1250MHz
Sapphire Radeon HD7870@1,218V 1200/1300MHz

Sys 4 i7 2600K@4,4GHz with Scythe Mine2
8GB G.Skill Sniper DDR3 1866MHz CL 9-10-9-28
MSI Z68A-GD65 (G3), various operating systems (WinXP, Ubuntu: 10.4.3 LTS, 12.04.2 LTS)
Optional: GTX560TI 448@stock/OC´d

Optional:
ASUS P5Q Pro with Q9550
ASUS P5Q Pro with Q6300
Location: Bavaria, Germany

Re: SUSE VMs on Xen

Post by folding_hoomer »

bigvbguy wrote:
bruce wrote:You have 116 CPUs. Are those virtualized CPUs that are mapped onto a smaller number of real CPUs? If so, that's your problem. There is no value whatsoever to running FAH on any number of CPUs which cannot be dedicated to the folding process. Performance degrades first, and then the virtual machine thrashes, trying to complete all 116 threads simultaneously, leading to a hopeless deadlock. [This may not be your problem, but I have to ask.]
No, the physical machine has 144 CPUs, and I have this VM pinned to 116 of them.
Are you really shure about 144 Cores?
Specification shows only 15 Cores/ 30 Threads per CPU - so you will get a max of 120 Cores . . .
See: http://ark.intel.com/products/75251/Int ... e-2_80-GHz
Image
Joe_H
Site Admin
Posts: 7929
Joined: Tue Apr 21, 2009 4:41 pm
Hardware configuration: Mac Pro 2.8 quad 12 GB smp4
MacBook Pro 2.9 i7 8 GB smp2
Location: W. MA

Re: SUSE VMs on Xen

Post by Joe_H »

One suggestion I would make for future runs is to avoid using the 116 setting. You had a number of immediate failures at the point the folding core would be splitting the processing up over the threads. As 116 is a multiple of a large prime number, 29, that is probably the cause of those failures. 112 might work better as the largest prime factor is 7. 7 is known to cause problems with some WU's, but may be okay with these bigadv ones. Another number that might work okay is 108, avoid 104 as that is multiple of 13.
Image

iMac 2.8 i7 12 GB smp8, Mac Pro 2.8 quad 12 GB smp6
MacBook Pro 2.9 i7 8 GB smp3
Post Reply