Page 1 of 2
					
				Memory leak @ 2416
				Posted: Wed Dec 12, 2007 12:29 am
				by czonkin
				Pls. help, on my Linux machine runs this:
...
[23:49:22] Folding@Home Gromacs Core
[23:49:22] Version 1.90 (March 8, 2006)
...
[23:49:24] Project: 2416 (Run 60, Clone 62, Gen 7)
...
and it's eating almost all my memory (as sys, not nice) and growing (~ 1M/sec) ...
Cpu(s):  0.0%us, 99.3%sy,  0.7%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:    953948k total,   947952k used,     5996k free,    71904k buffers
Swap:  1270072k total,  1270028k used,       44k free,   101644k cached
  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
 3661 root      39  19 1788m 596m  752 R 99.6 64.1  36:11.59 FahCore_78.exe
 3429 stanisla  15   0 14644  708  468 R  0.7  0.1   0:07.76 top
    1 root      15   0 10316  264  236 S  0.0  0.0   0:00.30 init
I'm not sure what to do, or if I make some mistake ...
Thanks!
Stanislav
			 
			
					
				
				Posted: Wed Dec 12, 2007 11:04 am
				by toTOW
				Is it doing the same if you restart your client 
 
Whar happens if you let it grow 

 Does it stops, or does the machine crash 

 
			
					
				
				Posted: Wed Dec 12, 2007 11:41 am
				by czonkin
				Yes, with standard loops or optimized, the same. No, it grows until machine (Fedora 7@Sempron 2200+) slows down to unusable state (full swap). Hard reset was the only solution.
Have I to delete this unit, or to try something else?
			 
			
					
				
				Posted: Wed Dec 12, 2007 12:53 pm
				by toTOW
				Make a backup of your FAH folder (including everything), just in case you need to send it to Stanford ... then delete the WU.
We'll wait for an answer from Stanford to see if they need you to send the backup or if they can have a look the that particular WU 
 
edit : I sent a mail to Paula who is in charge of this project ... let's wait for her answer 

 
			
					
				
				Posted: Wed Dec 12, 2007 2:43 pm
				by Ivoshiee
				You can send the WU to me as well.
			 
			
					
				I will take a look
				Posted: Wed Dec 12, 2007 6:28 pm
				by ppetrone
				Hey guys!
thank you for taking care of this.
I will take a look and come back to you asap.
pau
			 
			
					
				Re: I will take a look
				Posted: Thu Dec 13, 2007 12:52 am
				by czonkin
				Okay, so if you are still interested in it, so look at 	
http://rapidshare.com/files/76186778/FA ... k.zip.html
Thanks!
Stanislav
 
			
					
				Re: I will take a look
				Posted: Thu Dec 13, 2007 10:00 am
				by Ivoshiee
				
This WU is broken. I hope Pande Group can nail at least one cause of the 0x79 error with this WU.
The FAH504-Linux.exe will report this:
[10:05:40] Project: 2416 (Run 60, Clone 62, Gen 7)
[10:05:40] 
[10:05:40] Assembly optimizations on if available.
[10:05:40] Entering M.D.
  Gromacs is Copyright (c) 1991-2003, University of Groningen, The Netherlands
        This inclusion of Gromacs code in the Folding@Home Core is under
        a special license (see 
http://folding.stanford.edu/gromacs.html)
         specially granted to Stanford by the copyright holders. If you
          are interested in using Gromacs, visit 
http://www.gromacs.org where
                you can download a free version of Gromacs under
         the terms of the GNU General Public License (GPL) as published
       by the Free Software Foundation; either version 2 of the License,
                     or (at your option) any later version.
[10:05:47] Protein: p2416_Ribosome_Na
[10:05:47] 
[10:05:47] Writing local files
Fatal error: realloc for nlist->jjnr (1041039360 bytes, file ns.c, line 388, nlist->jjnr=0x0x74c48008): Cannot allocate memory
[10:07:01] Gromacs error.
[10:07:01] 
[10:07:01] Folding@home Core Shutdown: UNKNOWN_ERROR
[10:07:02] CoreStatus = 79 (121)
[10:07:02] Client-core communications error: ERROR 0x79
[10:07:02] Deleting current work unit & continuing...
[10:07:19] - Preparing to get new work unit...
[10:07:19] + Attempting to get work packet
[10:07:19] - Connecting to assignment server
[10:07:20] - Successful: assigned to (171.65.103.162).
[10:07:20] + News From Folding@Home: Welcome to Folding@Home
My box has 2 GB of memory and it managed to allocate a 80% before erroring out.
With fah6:
[10:11:36] Project: 2416 (Run 60, Clone 62, Gen 7)
[10:11:36] 
[10:11:36] Assembly optimizations on if available.
[10:11:36] Entering M.D.
  Gromacs is Copyright (c) 1991-2003, University of Groningen, The Netherlands
        This inclusion of Gromacs code in the Folding@Home Core is under
        a special license (see 
http://folding.stanford.edu/gromacs.html)
         specially granted to Stanford by the copyright holders. If you
          are interested in using Gromacs, visit 
http://www.gromacs.org where
                you can download a free version of Gromacs under
         the terms of the GNU General Public License (GPL) as published
       by the Free Software Foundation; either version 2 of the License,
                     or (at your option) any later version.
[10:11:43] Protein: p2416_Ribosome_Na
[10:11:43] 
[10:11:43] Writing local files
Fatal error: realloc for nlist->jjnr (1055457280 bytes, file ns.c, line 388, nlist->jjnr=0x0x73e61008): Cannot allocate memory
[10:13:13] Gromacs error.
[10:13:13] 
[10:13:13] Folding@home Core Shutdown: UNKNOWN_ERROR
[10:13:13] CoreStatus = 79 (121)
[10:13:13] Client-core communications error: ERROR 0x79
[10:13:13] Deleting current work unit & continuing...
[10:13:24] - Preparing to get new work unit...
[10:13:24] + Attempting to get work packet
[10:13:24] - Connecting to assignment server
This time it managed to allocate 85% of memory before death.
 
			
					
				Thank you!
				Posted: Thu Dec 13, 2007 6:18 pm
				by ppetrone
				Ok. I am convinced  
 
 
I will remove it for now. In the meantime, I am still running it...
Thank you everybody!
pau
 
			
					
				
				Posted: Thu Dec 13, 2007 8:15 pm
				by gwildperson
				Thank you, Paula.
We do hope that "development" can isolate this problem and deliver some new code that deals with this issue promptly.
			 
			
					
				Re: Thank you!
				Posted: Thu Dec 13, 2007 8:54 pm
				by Ivoshiee
				ppetrone wrote:Ok. I am convinced  
 
 
I will remove it for now. In the meantime, I am still running it...
Thank you everybody!
pau
 
I hope that the action is not only pulling of the WU, but a research why it is doing what it is doing and implementing a fix for that issue into the FAH Core files.
 
			
					
				Re: Thank you!
				Posted: Thu Dec 13, 2007 10:06 pm
				by ppetrone
				Yes, exactly. That is the only reason why I am running it.
The research will be to find out whether this is a specific WU problem (most likely) or a more general problem.
Thanks,
Pau
			 
			
					
				
				Posted: Thu Dec 13, 2007 10:55 pm
				by codysluder
				ppetrone wrote:The research will be to find out whether this is a specific WU problem (most likely) or a more general problem.
Even if you find that that specific WU has a problem, it also has a more general problem.  The client deleted the WU rather than reporting the problem to the server.  You (i.e.-Stanford) should have some indication that this WU has failed 100 times (or some much smaller number) so you can decide if it needs to be removed from circulation without having us keep the statistics for you.
 
			
					
				
				Posted: Thu Dec 13, 2007 11:56 pm
				by ppetrone
				As I said before I am currently working in this specific WU, to understand if the issue is generalized or not.
I am sorry if it seems as if "you" are keeping statistics for "us".
I understand Folding@Home as a big group of people (donors+scientists) doing statistics *together* to solve relevant biological questions. For that reason, I believe there is a tacit agreement of collaboration and patience.
Paula
			 
			
					
				
				Posted: Fri Dec 14, 2007 1:31 am
				by codysluder
				ppetrone wrote:I am sorry if it seems as if "you" are keeping statistics for "us".
Sorry, I didn't mean that there was a big distinction between "you" and "us" but I can see how it sounded like that.
In fact, there are three types of things that are collaboratively working on FAH.  Some things are best done by the Pande-group-type people.  Some things are best done by the donor-type people.  Some things are best done by software.
If several of the donor-type-people all encounter the same error, it's statistically unlikely that they'll find each other.  Nevertheless, in this instance they did.  Because of that, the donor-type-people were able to generate a request to find out what's going on with this case. (And a big thank you for accepting this responsibility)
Figuring out why the WU failed is best done by you, and that issue is (probably) important in more WUs than just that one, but even if it's unique, it's important.
If the FAH client is able to report this condition to the server, it's a statistical certainty that those various error reports can find each other and be examined by the Pande-group-type-people like yourself.  I decided to call that a universal bug, though maybe it can be considered as an enhancement request.  In any case, the reports to the server are a universal problem that is best done by improved software (no matter what is actually wrong with the WU).
As a donor-type-person, I'm also saying that it seems like we're wasting valuable resources (much more than "usually") repeating the same WUs with the the same errors many, many times and there ought to be a better way to find them and transfer them from our queue to your queue with less wasted resources.