repeated failure with project 100nn [Too many cores]
Moderators: Site Moderators, FAHC Science Team
repeated failure with project 100nn [Too many cores]
The two topics reporting problems have been merged after it was determined the cause for both was a mis-assignment of WU's for a small number of cores to many core systems.
My SR-2 has been getting project 21:36:40:WU01:FS01:0xa4:Project: 10090 (Run 98, Clone 23, Gen 0) and then failing to run. it will then cycle through several attempts to upload and start the project with repeated failures. Finally it will load a different project and then work fine. The typical log entry is as follows:
21:36:40:WU01:FS01:0xa4:Project: 10090 (Run 98, Clone 23, Gen 0)
21:36:40:WU01:FS01:0xa4:
21:36:40:WU01:FS01:0xa4:Entering M.D.
21:36:46:WU01:FS01:0xa4:Mapping NT from 22 to 20
21:36:58:WARNING:WU01:FS01:FahCore returned an unknown error code which probably indicates that it crashed
21:36:58:WARNING:WU01:FS01:FahCore returned: UNKNOWN_ENUM (-1073741783 = 0xc0000029)
21:37:30:WU01:FS01:Starting
21:37:30:WU01:FS01:Running FahCore: "C:\Program Files (x86)\FAHClient/FAHCoreWrapper.exe" C:/Users/Fred/AppData/Roaming/FAHClient/cores/web.stanford.edu/~pande/Win32/AMD64/Core_a4.fah/FahCore_a4.exe -dir 01 -suffix 01 -version 704 -lifeline 2548 -checkpoint 15 -np 22
21:37:30:WU01:FS01:Started FahCore on PID 1068
21:37:30:WU01:FS01:Core PID:4124
21:37:30:WU01:FS01:FahCore 0xa4 started
Then it repeats the failure:
21:37:40:WU01:FS01:0xa4:- Files status OK
21:37:40:WU01:FS01:0xa4:- Expanded 45496 -> 206116 (decompressed 453.0 percent)
21:37:40:WU01:FS01:0xa4:Called DecompressByteArray: compressed_data_size=45496 data_size=206116, decompressed_data_size=206116 diff=0
21:37:40:WU01:FS01:0xa4:- Digital signature verified
21:37:40:WU01:FS01:0xa4:
21:37:40:WU01:FS01:0xa4:Project: 10090 (Run 98, Clone 23, Gen 0)
21:37:40:WU01:FS01:0xa4:
21:37:40:WU01:FS01:0xa4:Entering M.D.
21:37:46:WU01:FS01:0xa4:Mapping NT from 22 to 20
21:37:46:WU01:FS01:0xa4:mdrun returned 255
21:37:46:WU01:FS01:0xa4:Going to send back what have done -- stepsTotalG=10000000
21:37:46:WU01:FS01:0xa4:Work fraction=0.0000 steps=10000000.
21:37:50:WU01:FS01:0xa4:logfile size=0 infoLength=0 edr=0 trr=25
21:37:50:WU01:FS01:0xa4:logfile size: 0 info=0 bed=0 hdr=25
21:37:50:WU01:FS01:0xa4:- Writing 642 bytes of core data to disk...
21:37:50:WU01:FS01:0xa4:Done: 130 -> 143 (compressed to 110.0 percent)
21:37:50:WU01:FS01:0xa4: ... Done.
21:37:50:WU01:FS01:0xa4:
21:37:50:WU01:FS01:0xa4:Folding@home Core Shutdown: EARLY_UNIT_END
21:37:51:WARNING:WU01:FS01:FahCore returned: BAD_WORK_UNIT (114 = 0x72)
21:37:51:WU01:FS01:Sending unit results: id:01 state:SEND error:FAULTY project:10090 run:98 clone:23 gen:0 core:0xa4 unit:0x000000000001329c546e75549e6f2853
Not sure what is wrong, but this project won't run on my computer. The computer is a SR-2 with E5649 CPUs. I also have a GTX780 Ti in it running a GPU client.
My SR-2 has been getting project 21:36:40:WU01:FS01:0xa4:Project: 10090 (Run 98, Clone 23, Gen 0) and then failing to run. it will then cycle through several attempts to upload and start the project with repeated failures. Finally it will load a different project and then work fine. The typical log entry is as follows:
21:36:40:WU01:FS01:0xa4:Project: 10090 (Run 98, Clone 23, Gen 0)
21:36:40:WU01:FS01:0xa4:
21:36:40:WU01:FS01:0xa4:Entering M.D.
21:36:46:WU01:FS01:0xa4:Mapping NT from 22 to 20
21:36:58:WARNING:WU01:FS01:FahCore returned an unknown error code which probably indicates that it crashed
21:36:58:WARNING:WU01:FS01:FahCore returned: UNKNOWN_ENUM (-1073741783 = 0xc0000029)
21:37:30:WU01:FS01:Starting
21:37:30:WU01:FS01:Running FahCore: "C:\Program Files (x86)\FAHClient/FAHCoreWrapper.exe" C:/Users/Fred/AppData/Roaming/FAHClient/cores/web.stanford.edu/~pande/Win32/AMD64/Core_a4.fah/FahCore_a4.exe -dir 01 -suffix 01 -version 704 -lifeline 2548 -checkpoint 15 -np 22
21:37:30:WU01:FS01:Started FahCore on PID 1068
21:37:30:WU01:FS01:Core PID:4124
21:37:30:WU01:FS01:FahCore 0xa4 started
Then it repeats the failure:
21:37:40:WU01:FS01:0xa4:- Files status OK
21:37:40:WU01:FS01:0xa4:- Expanded 45496 -> 206116 (decompressed 453.0 percent)
21:37:40:WU01:FS01:0xa4:Called DecompressByteArray: compressed_data_size=45496 data_size=206116, decompressed_data_size=206116 diff=0
21:37:40:WU01:FS01:0xa4:- Digital signature verified
21:37:40:WU01:FS01:0xa4:
21:37:40:WU01:FS01:0xa4:Project: 10090 (Run 98, Clone 23, Gen 0)
21:37:40:WU01:FS01:0xa4:
21:37:40:WU01:FS01:0xa4:Entering M.D.
21:37:46:WU01:FS01:0xa4:Mapping NT from 22 to 20
21:37:46:WU01:FS01:0xa4:mdrun returned 255
21:37:46:WU01:FS01:0xa4:Going to send back what have done -- stepsTotalG=10000000
21:37:46:WU01:FS01:0xa4:Work fraction=0.0000 steps=10000000.
21:37:50:WU01:FS01:0xa4:logfile size=0 infoLength=0 edr=0 trr=25
21:37:50:WU01:FS01:0xa4:logfile size: 0 info=0 bed=0 hdr=25
21:37:50:WU01:FS01:0xa4:- Writing 642 bytes of core data to disk...
21:37:50:WU01:FS01:0xa4:Done: 130 -> 143 (compressed to 110.0 percent)
21:37:50:WU01:FS01:0xa4: ... Done.
21:37:50:WU01:FS01:0xa4:
21:37:50:WU01:FS01:0xa4:Folding@home Core Shutdown: EARLY_UNIT_END
21:37:51:WARNING:WU01:FS01:FahCore returned: BAD_WORK_UNIT (114 = 0x72)
21:37:51:WU01:FS01:Sending unit results: id:01 state:SEND error:FAULTY project:10090 run:98 clone:23 gen:0 core:0xa4 unit:0x000000000001329c546e75549e6f2853
Not sure what is wrong, but this project won't run on my computer. The computer is a SR-2 with E5649 CPUs. I also have a GTX780 Ti in it running a GPU client.
-
- Site Admin
- Posts: 7927
- Joined: Tue Apr 21, 2009 4:41 pm
- Hardware configuration: Mac Pro 2.8 quad 12 GB smp4
MacBook Pro 2.9 i7 8 GB smp2 - Location: W. MA
Re: repeated failure with project 10090 (Run 98, Clone 23, G
This is just one WU from Project 10090, it has been successfully completed by another folder. Are you also getting other WU's from this project that fail to run? If so, please give us a list of those that do not work on your setup. It is possible that 20 threads is too many for WU's from this project in general, or that it has a problem with decomposition that involves 5 as a factor. If there is more than one example, we will bring it to the attention of the researcher running this project so the assignment settings can be modified and this project not assigned to systems similar to yours.
iMac 2.8 i7 12 GB smp8, Mac Pro 2.8 quad 12 GB smp6
MacBook Pro 2.9 i7 8 GB smp3
Project 10085 Failed (48 core system)
Hey all,
so I've been getting assigned these WUs regularly and they continue to fail on my machine. there error returned in this log is just one example, it seems to return different error messages all the time. I can successfully fold other projects just this one continues to fail. Also, it is not just this WU but others in the same project as I look through the log and can see that after a couple failed attempts it would try a new WU which would also be a 10085 but different run clone and gen which would then fail again. If you need more info... please let me know and I'll try to get around to posting it.
so I've been getting assigned these WUs regularly and they continue to fail on my machine. there error returned in this log is just one example, it seems to return different error messages all the time. I can successfully fold other projects just this one continues to fail. Also, it is not just this WU but others in the same project as I look through the log and can see that after a couple failed attempts it would try a new WU which would also be a 10085 but different run clone and gen which would then fail again. If you need more info... please let me know and I'll try to get around to posting it.
Code: Select all
22:39:25:WU00:FS00:0xa4:Project: 10085 (Run 5, Clone 214, Gen 3)
22:39:25:WU00:FS00:0xa4:
22:39:25:WU00:FS00:0xa4:Assembly optimizations on if available.
22:39:25:WU00:FS00:0xa4:Entering M.D.
22:39:31:WU00:FS00:0xa4:mdrun returned 255
22:39:31:WU00:FS00:0xa4:Going to send back what have done -- stepsTotalG=10000000
22:39:31:WU00:FS00:0xa4:Work fraction=12884901888.0000 steps=10000000.
22:39:35:WU00:FS00:0xa4:logfile size=7942 infoLength=7942 edr=25 trr=1
22:39:35:WU00:FS00:0xa4:logfile size: 7942 info=7942 bed=25 hdr=1
22:39:35:WU00:FS00:0xa4:- Writing 8480 bytes of core data to disk...
22:39:35:WU00:FS00:0xa4:Done: 7968 -> 2787 (compressed to 34.9 percent)
22:39:35:WU00:FS00:0xa4: ... Done.
22:39:36:WU00:FS00:0xa4:
22:39:36:WU00:FS00:0xa4:Folding@home Core Shutdown: UNSTABLE_MACHINE
22:39:36:WARNING:WU00:FS00:FahCore returned: UNSTABLE_MACHINE (122 = 0x7a)
Re: Project 10085 Failed (48 core system)
I have to make a correction... I am also getting the errors on P10083 on various run clone and gen...
Re: repeated failure with project 10090 (Run 98, Clone 23, G
My SR-2 has had the same problem with projects 10090 (run 118, clone 9, gen 1), 10090 (run 161, clone 11, gen 0), 10090 (run 128, clone 13, gen 0) and a 10070, though I don't remember which run, clone and gen of the 10070. I'll watch it and copy down which ones that fail during the next few days. I have other computers that are 6/12 CPU types and they have not run into any problems.
-
- Site Admin
- Posts: 7927
- Joined: Tue Apr 21, 2009 4:41 pm
- Hardware configuration: Mac Pro 2.8 quad 12 GB smp4
MacBook Pro 2.9 i7 8 GB smp2 - Location: W. MA
Re: repeated failure with project 10090 (Run 98, Clone 23, G
The client by default keeps the 16 most recent logs in a folder in the F@H data directory along with the current log. If you could search those and post the PRCG's of the failing work units, that would help.
As for the ones you just mentioned, all have been successfully completed. The second one did have a couple failures as well.
P.S. A message has been sent to the researcher in charge of this project.
As for the ones you just mentioned, all have been successfully completed. The second one did have a couple failures as well.
P.S. A message has been sent to the researcher in charge of this project.
iMac 2.8 i7 12 GB smp8, Mac Pro 2.8 quad 12 GB smp6
MacBook Pro 2.9 i7 8 GB smp3
-
- Site Admin
- Posts: 7927
- Joined: Tue Apr 21, 2009 4:41 pm
- Hardware configuration: Mac Pro 2.8 quad 12 GB smp4
MacBook Pro 2.9 i7 8 GB smp2 - Location: W. MA
Re: Project 10085 Failed (48 core system)
Please also post the beginning section of your log file that shows the version information, system info and the folding configuration. More of the log that also showed the beginning of the core starting up with the WU that failed would also be useful.
For each of the other WU's that have failed on your system, could you post the Project, Run, Clone and Generation numbers. Those specify unique WU's that can be checked for problems, or to see if there is a pattern of which ones fail.
P.S. Since the failure appears similar to the one reported here - viewtopic.php?f=19&t=27097, a message has been sent to the researcher responsible for the server these projects come from.
For each of the other WU's that have failed on your system, could you post the Project, Run, Clone and Generation numbers. Those specify unique WU's that can be checked for problems, or to see if there is a pattern of which ones fail.
P.S. Since the failure appears similar to the one reported here - viewtopic.php?f=19&t=27097, a message has been sent to the researcher responsible for the server these projects come from.
iMac 2.8 i7 12 GB smp8, Mac Pro 2.8 quad 12 GB smp6
MacBook Pro 2.9 i7 8 GB smp3
Re: Project 10085 Failed (48 core system)
10084 (Run 4, Clone 429, Gen 0) and also 10083 of various types - everything else runs fine.....
24 core system - the other lower core count systems I have are all humming along.
In all cases FahCore crashs and logs: FahCore returned: UNKNOWN_ENUM (-1073741783 = 0xc0000029)
24 core system - the other lower core count systems I have are all humming along.
In all cases FahCore crashs and logs: FahCore returned: UNKNOWN_ENUM (-1073741783 = 0xc0000029)
Re: Project 10085 Failed (48 core system)
I will try to get around to fishing out the logs tomorrow afternoon. I see that it has failed a bunch more WU last night as well and I will confirm which precise WUs are failing tomorrow.
-
- Posts: 10179
- Joined: Thu Nov 29, 2007 4:30 pm
- Hardware configuration: Intel i7-4770K @ 4.5 GHz, 16 GB DDR3-2133 Corsair Vengence (black/red), EVGA GTX 760 @ 1200 MHz, on an Asus Maximus VI Hero MB (black/red), in a blacked out Antec P280 Tower, with a Xigmatek Night Hawk (black) HSF, Seasonic 760w Platinum (black case, sleeves, wires), 4 SilenX 120mm Case fans with silicon fan gaskets and silicon mounts (all black), a 512GB Samsung SSD (black), and a 2TB Black Western Digital HD (silver/black).
- Location: Arizona
- Contact:
Re: Project 10085 Failed (48 core system)
As a test, do these projects complete if you change to 24 cores?
How to provide enough information to get helpful support
Tell me and I forget. Teach me and I remember. Involve me and I learn.
Tell me and I forget. Teach me and I remember. Involve me and I learn.
Re: Project 10085 Failed (48 core system)
I hope this is enough of the log... I think that is what you need anyways. I will now try and make a list of WU that have failed...
Code: Select all
16:03:39:WU00:FS00:Starting
16:03:39:WU00:FS00:Running FahCore: /usr/bin/FAHCoreWrapper /var/lib/fahclient/cores/web.stanford.edu/~pande/Linux/AMD64/Core_a4.fah/FahCore_a4 -dir 00 -suffix 01 -version 703 -lifeline 1597 -checkpoint 24 -np 48
16:03:39:WU00:FS00:Started FahCore on PID 21985
16:03:39:WU00:FS00:Core PID:21989
16:03:39:WU00:FS00:FahCore 0xa4 started
16:03:39:WU00:FS00:0xa4:
16:03:39:WU00:FS00:0xa4:*------------------------------*
16:03:39:WU00:FS00:0xa4:Folding@Home Gromacs GB Core
16:03:39:WU00:FS00:0xa4:Version 2.27 (Dec. 15, 2010)
16:03:39:WU00:FS00:0xa4:
16:03:39:WU00:FS00:0xa4:Preparing to commence simulation
16:03:39:WU00:FS00:0xa4:- Ensuring status. Please wait.
16:03:48:WU00:FS00:0xa4:- Looking at optimizations...
16:03:48:WU00:FS00:0xa4:- Working with standard loops on this execution.
16:03:48:WU00:FS00:0xa4:Examination of work files indicates 8 consecutive improper terminations of core.
16:03:48:WU00:FS00:0xa4:- Expanded 53806 -> 201448 (decompressed 374.3 percent)
16:03:48:WU00:FS00:0xa4:Called DecompressByteArray: compressed_data_size=53806 data_size=201448, decompressed_data_size=201448 diff=0
16:03:48:WU00:FS00:0xa4:- Digital signature verified
16:03:48:WU00:FS00:0xa4:
16:03:48:WU00:FS00:0xa4:Project: 10085 (Run 2, Clone 656, Gen 4)
16:03:48:WU00:FS00:0xa4:
16:03:48:WU00:FS00:0xa4:Entering M.D.
16:03:54:WU00:FS00:0xa4:mdrun returned 255
16:03:54:WU00:FS00:0xa4:Going to send back what have done -- stepsTotalG=10000000
16:03:54:WU00:FS00:0xa4:Work fraction=17179869184.0000 steps=10000000.
16:03:55:WU00:FS00:FahCore returned: INTERRUPTED (102 = 0x66)
Re: repeated failure with project 10090 (Run 98, Clone 23, G
Yesterday (Friday) I received several more projects that failed. They are: 10090 (212, 24, 0), 10090 (211, 26, 0), 10090 (6, 34, 0), 10090 (47, 7 0), 10090 (151, 35, 0), 10090 (18, 7, 1), 10090 (160, 35 0), 10090 (165, 35, 0), 10090 (210, 34, 0), and 10084 (5, 151, 3).
Whereas earlier I would get switched to an older project which did run after a failure of one of the problem WUs, I started receiving nothing but WUs that won't run. I finally paused the client and gave up. I don't have the time to just sit at the computer, close the program and wait for a new WU assigned only to get another one which won't run.
Whereas earlier I would get switched to an older project which did run after a failure of one of the problem WUs, I started receiving nothing but WUs that won't run. I finally paused the client and gave up. I don't have the time to just sit at the computer, close the program and wait for a new WU assigned only to get another one which won't run.
Re: Project 10085 Failed (48 core system)
P10083,4,118,1
P10083,0,172,1
P10085,4,92,4
P10084,5,67,2
P10083,4,177,0
P10085,3,59,7
P10084,2,97,1
P10083,2,103,6
P10083,4,208,0
P10085,0,206,2
P10083,5,208,0
P10085,5,214,3
P10085,4,31,5
P10084,1,352,0
P10084,2,352,0
P10083,2,395,1
P10084,2,443,0
P10083,6,387,1
P10084,6,443,0
P10083,4,350,2
P10085,4,533,0
P10083,2,403,1
P10083,1,433,1
P10083,3,309,1
P10083,0,435,1
P10085,5,223,6
P10083,3,466,0
P10085,4,274,2
P10085,4,6,9
P10085,4,582,1
P10084,0,484,1
P10083,5,480,2
P10085,3,464,1
P10083,1,509,3
P10084,5,977,1
P10083,5,933,1
P10084,2,956,1
P10084,0,885,2
P10085,2,656,4
P10085,3,85,8
P10083,0,172,1
P10085,4,92,4
P10084,5,67,2
P10083,4,177,0
P10085,3,59,7
P10084,2,97,1
P10083,2,103,6
P10083,4,208,0
P10085,0,206,2
P10083,5,208,0
P10085,5,214,3
P10085,4,31,5
P10084,1,352,0
P10084,2,352,0
P10083,2,395,1
P10084,2,443,0
P10083,6,387,1
P10084,6,443,0
P10083,4,350,2
P10085,4,533,0
P10083,2,403,1
P10083,1,433,1
P10083,3,309,1
P10083,0,435,1
P10085,5,223,6
P10083,3,466,0
P10085,4,274,2
P10085,4,6,9
P10085,4,582,1
P10084,0,484,1
P10083,5,480,2
P10085,3,464,1
P10083,1,509,3
P10084,5,977,1
P10083,5,933,1
P10084,2,956,1
P10084,0,885,2
P10085,2,656,4
P10085,3,85,8
Re: Project 10085 Failed (48 core system)
That is the list so far. When my current WU finishes I will split the slot into 2 and see whether 24 cores allows it to fold. I will also add a 4 core slot an 8 and a 12 and I will watch it to the best of my ability
Re: Project 10085 Failed (48 core system)
Similar to AtwaterFS, I have received a project 10084 (5, 151, 3) that refused to run on my SR-2. I only have 22 cores engaged as I leave 2 cores for the GPU client. I've received a project 10070 that wouldn't run and several project 10090 series that will not run with a log entry; FahCore returned: UNKNOWN_ENUM (-1073741783 = 0xc0000029).