Memory leak @ 2416

Moderators: Site Moderators, FAHC Science Team

czonkin
Posts: 3
Joined: Wed Dec 12, 2007 12:20 am
Location: Czech republic

Memory leak @ 2416

Post by czonkin »

Pls. help, on my Linux machine runs this:
...
[23:49:22] Folding@Home Gromacs Core
[23:49:22] Version 1.90 (March 8, 2006)
...
[23:49:24] Project: 2416 (Run 60, Clone 62, Gen 7)
...
and it's eating almost all my memory (as sys, not nice) and growing (~ 1M/sec) ...
Cpu(s): 0.0%us, 99.3%sy, 0.7%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 953948k total, 947952k used, 5996k free, 71904k buffers
Swap: 1270072k total, 1270028k used, 44k free, 101644k cached

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
3661 root 39 19 1788m 596m 752 R 99.6 64.1 36:11.59 FahCore_78.exe
3429 stanisla 15 0 14644 708 468 R 0.7 0.1 0:07.76 top
1 root 15 0 10316 264 236 S 0.0 0.0 0:00.30 init

I'm not sure what to do, or if I make some mistake ...

Thanks!

Stanislav
toTOW
Site Moderator
Posts: 6349
Joined: Sun Dec 02, 2007 10:38 am
Location: Bordeaux, France
Contact:

Post by toTOW »

Is it doing the same if you restart your client :?:

Whar happens if you let it grow :?: Does it stops, or does the machine crash :?:
Image

Folding@Home beta tester since 2002. Folding Forum moderator since July 2008.
czonkin
Posts: 3
Joined: Wed Dec 12, 2007 12:20 am
Location: Czech republic

Post by czonkin »

Yes, with standard loops or optimized, the same. No, it grows until machine (Fedora 7@Sempron 2200+) slows down to unusable state (full swap). Hard reset was the only solution.

Have I to delete this unit, or to try something else?
toTOW
Site Moderator
Posts: 6349
Joined: Sun Dec 02, 2007 10:38 am
Location: Bordeaux, France
Contact:

Post by toTOW »

Make a backup of your FAH folder (including everything), just in case you need to send it to Stanford ... then delete the WU.

We'll wait for an answer from Stanford to see if they need you to send the backup or if they can have a look the that particular WU ;)

edit : I sent a mail to Paula who is in charge of this project ... let's wait for her answer ;)
Image

Folding@Home beta tester since 2002. Folding Forum moderator since July 2008.
Ivoshiee
Site Moderator
Posts: 822
Joined: Sun Dec 02, 2007 12:05 am
Location: Estonia

Post by Ivoshiee »

You can send the WU to me as well.
ppetrone
Pande Group Member
Posts: 115
Joined: Wed Dec 12, 2007 6:20 pm
Location: Stanford
Contact:

I will take a look

Post by ppetrone »

Hey guys!
thank you for taking care of this.
I will take a look and come back to you asap.

pau
czonkin
Posts: 3
Joined: Wed Dec 12, 2007 12:20 am
Location: Czech republic

Re: I will take a look

Post by czonkin »

Okay, so if you are still interested in it, so look at http://rapidshare.com/files/76186778/FA ... k.zip.html

Thanks!

Stanislav
Ivoshiee
Site Moderator
Posts: 822
Joined: Sun Dec 02, 2007 12:05 am
Location: Estonia

Re: I will take a look

Post by Ivoshiee »

czonkin wrote:Okay, so if you are still interested in it, so look at http://rapidshare.com/files/76186778/FA ... k.zip.html

Thanks!

Stanislav
This WU is broken. I hope Pande Group can nail at least one cause of the 0x79 error with this WU.

The FAH504-Linux.exe will report this:
[10:05:40] Project: 2416 (Run 60, Clone 62, Gen 7)
[10:05:40]
[10:05:40] Assembly optimizations on if available.
[10:05:40] Entering M.D.

Gromacs is Copyright (c) 1991-2003, University of Groningen, The Netherlands
This inclusion of Gromacs code in the Folding@Home Core is under
a special license (see http://folding.stanford.edu/gromacs.html)
specially granted to Stanford by the copyright holders. If you
are interested in using Gromacs, visit http://www.gromacs.org where
you can download a free version of Gromacs under
the terms of the GNU General Public License (GPL) as published
by the Free Software Foundation; either version 2 of the License,
or (at your option) any later version.

[10:05:47] Protein: p2416_Ribosome_Na
[10:05:47]
[10:05:47] Writing local files
Fatal error: realloc for nlist->jjnr (1041039360 bytes, file ns.c, line 388, nlist->jjnr=0x0x74c48008): Cannot allocate memory
[10:07:01] Gromacs error.
[10:07:01]
[10:07:01] Folding@home Core Shutdown: UNKNOWN_ERROR
[10:07:02] CoreStatus = 79 (121)
[10:07:02] Client-core communications error: ERROR 0x79
[10:07:02] Deleting current work unit & continuing...
[10:07:19] - Preparing to get new work unit...
[10:07:19] + Attempting to get work packet
[10:07:19] - Connecting to assignment server
[10:07:20] - Successful: assigned to (171.65.103.162).
[10:07:20] + News From Folding@Home: Welcome to Folding@Home
My box has 2 GB of memory and it managed to allocate a 80% before erroring out.


With fah6:
[10:11:36] Project: 2416 (Run 60, Clone 62, Gen 7)
[10:11:36]
[10:11:36] Assembly optimizations on if available.
[10:11:36] Entering M.D.

Gromacs is Copyright (c) 1991-2003, University of Groningen, The Netherlands
This inclusion of Gromacs code in the Folding@Home Core is under
a special license (see http://folding.stanford.edu/gromacs.html)
specially granted to Stanford by the copyright holders. If you
are interested in using Gromacs, visit http://www.gromacs.org where
you can download a free version of Gromacs under
the terms of the GNU General Public License (GPL) as published
by the Free Software Foundation; either version 2 of the License,
or (at your option) any later version.

[10:11:43] Protein: p2416_Ribosome_Na
[10:11:43]
[10:11:43] Writing local files
Fatal error: realloc for nlist->jjnr (1055457280 bytes, file ns.c, line 388, nlist->jjnr=0x0x73e61008): Cannot allocate memory
[10:13:13] Gromacs error.
[10:13:13]
[10:13:13] Folding@home Core Shutdown: UNKNOWN_ERROR
[10:13:13] CoreStatus = 79 (121)
[10:13:13] Client-core communications error: ERROR 0x79
[10:13:13] Deleting current work unit & continuing...
[10:13:24] - Preparing to get new work unit...
[10:13:24] + Attempting to get work packet
[10:13:24] - Connecting to assignment server
This time it managed to allocate 85% of memory before death.
ppetrone
Pande Group Member
Posts: 115
Joined: Wed Dec 12, 2007 6:20 pm
Location: Stanford
Contact:

Thank you!

Post by ppetrone »

Ok. I am convinced :?
I will remove it for now. In the meantime, I am still running it...

Thank you everybody!

pau
gwildperson
Posts: 450
Joined: Tue Dec 04, 2007 8:36 pm

Post by gwildperson »

Thank you, Paula.

We do hope that "development" can isolate this problem and deliver some new code that deals with this issue promptly.
Ivoshiee
Site Moderator
Posts: 822
Joined: Sun Dec 02, 2007 12:05 am
Location: Estonia

Re: Thank you!

Post by Ivoshiee »

ppetrone wrote:Ok. I am convinced :?
I will remove it for now. In the meantime, I am still running it...

Thank you everybody!

pau
I hope that the action is not only pulling of the WU, but a research why it is doing what it is doing and implementing a fix for that issue into the FAH Core files.
ppetrone
Pande Group Member
Posts: 115
Joined: Wed Dec 12, 2007 6:20 pm
Location: Stanford
Contact:

Re: Thank you!

Post by ppetrone »

Yes, exactly. That is the only reason why I am running it.
The research will be to find out whether this is a specific WU problem (most likely) or a more general problem.
Thanks,

Pau
codysluder
Posts: 1024
Joined: Sun Dec 02, 2007 12:43 pm

Post by codysluder »

ppetrone wrote:The research will be to find out whether this is a specific WU problem (most likely) or a more general problem.
Even if you find that that specific WU has a problem, it also has a more general problem. The client deleted the WU rather than reporting the problem to the server. You (i.e.-Stanford) should have some indication that this WU has failed 100 times (or some much smaller number) so you can decide if it needs to be removed from circulation without having us keep the statistics for you.
ppetrone
Pande Group Member
Posts: 115
Joined: Wed Dec 12, 2007 6:20 pm
Location: Stanford
Contact:

Post by ppetrone »

As I said before I am currently working in this specific WU, to understand if the issue is generalized or not.

I am sorry if it seems as if "you" are keeping statistics for "us".

I understand Folding@Home as a big group of people (donors+scientists) doing statistics *together* to solve relevant biological questions. For that reason, I believe there is a tacit agreement of collaboration and patience.

Paula
codysluder
Posts: 1024
Joined: Sun Dec 02, 2007 12:43 pm

Post by codysluder »

ppetrone wrote:I am sorry if it seems as if "you" are keeping statistics for "us".
Sorry, I didn't mean that there was a big distinction between "you" and "us" but I can see how it sounded like that.

In fact, there are three types of things that are collaboratively working on FAH. Some things are best done by the Pande-group-type people. Some things are best done by the donor-type people. Some things are best done by software.

If several of the donor-type-people all encounter the same error, it's statistically unlikely that they'll find each other. Nevertheless, in this instance they did. Because of that, the donor-type-people were able to generate a request to find out what's going on with this case. (And a big thank you for accepting this responsibility)

Figuring out why the WU failed is best done by you, and that issue is (probably) important in more WUs than just that one, but even if it's unique, it's important.

If the FAH client is able to report this condition to the server, it's a statistical certainty that those various error reports can find each other and be examined by the Pande-group-type-people like yourself. I decided to call that a universal bug, though maybe it can be considered as an enhancement request. In any case, the reports to the server are a universal problem that is best done by improved software (no matter what is actually wrong with the WU).

As a donor-type-person, I'm also saying that it seems like we're wasting valuable resources (much more than "usually") repeating the same WUs with the the same errors many, many times and there ought to be a better way to find them and transfer them from our queue to your queue with less wasted resources.
Post Reply