benchmarking F@H

alpha754293 · Post by **alpha754293** » Wed Jan 21, 2009 3:18 am

Is there a really good, reliable, (quick) way of benchmarking F@H?

I was thinking of keeping a WU to as close to its deadline as possible, but I don't think that's a very good idea in the end.

It'll give me consistent data to work with, but there's always the risk that it will run over.

Then I started thinking if there would be a way for me to do one like ONE of the project#s (specifically).

I'm trying to find out what would be the best Linux distro to go with that will generate the most PPD ranging from CNL to CentOS to SuSE to Ubuntu.

If I let the F@H client actually work for a few days each, the testing is expected to take anywhere from a month to about 100 days.

I WISHED that I could test HTT on a Core i7, but alas, the closest thing I've got is a Q9550 and I'm holding off on changing that config because I might need that system for some of my class work.

Any ideas?

Post by **bruce** » Wed Jan 21, 2009 3:49 am

alpha754293 wrote:Is there a really good, reliable, (quick) way of benchmarking F@H?

I was thinking of keeping a WU to as close to its deadline as possible, but I don't think that's a very good idea in the end.

No, that's not a good idea. How about this:

If the current WU is nearly finished, let it complete and download a new one.
Make a backup of your current FAH directory.
Allow the current WU to finish naturally (perhaps with the -oneunit flag, if that's appropriate), noting the time for, say, 10 frames.
Restore the backup to some other platform and reprocess the same 10 frames. Repeat as necessary.

For the Windows client, there's an advanced setting that informs the client that the local clock cannot be trusted. If you use that setting, the client will not delete the WU when it expires. I don't remember if Linux/MacOS has the same setting.

Depending on how (quick) or (accurate) you want your measurement to be, use fewer or more frames.

alpha754293 · Post by **alpha754293** » Wed Jan 21, 2009 4:14 am

bruce wrote:
alpha754293 wrote:Is there a really good, reliable, (quick) way of benchmarking F@H?

I was thinking of keeping a WU to as close to its deadline as possible, but I don't think that's a very good idea in the end.
No, that's not a good idea. How about this:

If the current WU is nearly finished, let it complete and download a new one.
Make a backup of your current FAH directory.
Allow the current WU to finish naturally (perhaps with the -oneunit flag, if that's appropriate), noting the time for, say, 10 frames.
Restore the backup to some other platform and reprocess the same 10 frames. Repeat as necessary.

For the Windows client, there's an advanced setting that informs the client that the local clock cannot be trusted. If you use that setting, the client will not delete the WU when it expires. I don't remember if Linux/MacOS has the same setting.

Depending on how (quick) or (accurate) you want your measurement to be, use fewer or more frames.

Is a "frame" a percentage point in the analysis?

See, I still worry about doing that (similiar idea that I had, only I do apologize that I didn't quite explain it right in the sense that the checkpoint would be stored on one of my other servers only for the purposes of the benchmark.) However, if the benchmarking takes a while (for example, one of my current WU (p2665) has deadline of 5 days, 13 hours), which kind of indicates how long it would take for it to actually run it.

It's taken the Windows SMP client 4h 10 m to get to 10% (which I'm assuming to mean 10 frames). So, suppose that it takes 2 hours per OS install, in the 5 d 13 h, if I don't sleep, I'd only be able to test 22 installations/configurations.

On the assumption that I sleep for 8 hours a day, that'd mean only 15 installations.

And with my current estimates of 48 hours to actually complete the unit, really, I would have only about like 72 hours or so before the WU expires altogether and would be considered "lost".

So I'm not sure if there's a better way to benchmark the F@H performance.

Post by **bruce** » Wed Jan 21, 2009 4:23 am

Rather than benchmarking with the SMP client, consider benchmarking with the uniprocessor client. The deadlines are MUCH longer so that shouldn't be a problem. Once you've eliminated most of the slower choices, you can always re-benchmark a few systems with the SMP client and see if it's significantly different. You can also use StressCPU2 which is a lot like the Gromacs code but doesn't require you to waste a potentially useful result by downloading it and (perhaps) not returning it promptly.

Most WUs have 100 frames, so yes, a frame is 1%, but in rare cases, there are exceptions.

A WU is considered "lost" when it passes the Preferred Deadline and a new one is issued.

alpha754293 · Post by **alpha754293** » Wed Jan 21, 2009 4:36 am

bruce wrote:Rather than benchmarking with the SMP client, consider benchmarking with the uniprocessor client. The deadlines are MUCH longer so that shouldn't be a problem. Once you've eliminated most of the slower choices, you can always re-benchmark a few systems with the SMP client and see if it's significantly different. You can also use StressCPU2 which is a lot like the Gromacs code but doesn't require you to waste a potentially useful result by downloading it and (perhaps) not returning it promptly.

Most WUs have 100 frames, so yes, a frame is 1%, but in rare cases, there are exceptions.

A WU is considered "lost" when it passes the Preferred Deadline and a new one is issued.

Well..the idea for the benchmarking really started when some people were saying that the Linux client, particularly with Ubuntu 8.04, Core i7 with HTT enabled stating that it's quite possibly one of the fastest configurations you can have nowadays for F@H.

I reckon that enabling HTT would be a bad idea because if the F@H core is computationally intensive (which it often is, hence the premise of the reason for a distributed computing platform), seems to be a performance booster.

And then in another post, someone mentioned that there were problems with the initial release of the kernel included with Ubuntu 8.10 and that apparently that kernel has been patched, but doesn't ship natively with the distribution (and is predicted that it will not be fixed until Ubuntu 9).

So, that was how that came about.

As far as I know, the uniprocessor client ISN'T in a beta phase, and therefore; would tend to indicate that it is pretty functional. Now, on the other hand, if there are signs that it is exhibited some of the "unusual" behavior like that of the SMP beta client, then sure, I can use the uniprocessor client to do the same tests.

And if there's some way to correlate the PPD results from the uniprocessor client to the PPD results of the SMP beta client; then absolutely!

Otherwise, I would normally to test in an "exact" manner, rather than trying to guestimate/extrapolate results, especially if I know that the tests CAN be performed.

The question being posed here is "what IS the best method to go about testing it?" Given the current timeframes that we're working with here; computational resources are scarce as is time, as always; so there's got to be a way for us to be able to conduct these tests with minimal adverse impact as possible. Don't you agree?

Post by **bruce** » Wed Jan 21, 2009 4:56 am

From a science perspective, the disadvantage of HyperThreading is that many people run two uniprocessor clients on an old P4 rather than one. This means that each one runs at something like 60% of the expected speed and folks get about 1.2x the number of points. Throughput is limited by 100% FPU utilization (using SSE), of course, but with a single non HyperThreaded client does spend a small amount of time waiting on integer instructions. By running two clients, all of the integer instructions can be overlapped with SSE processing. The scientific cost of running at 60% effective speed is higher than the benefit of doing more total work.

For the i7 I suppose there is a similar benefit from the integer/float overlap, but since all virtual CPUs are working on the same WU, there would be no 60% slowdown. Dividing a WU into 8 parallel segments rather than 4 gives back what HyperThreading takes away. Any effects of FPU saturation would still be applied to the same WU anyway.

Of course if the 8 segments are unequal you still get a full CPU working on the last 4 segments while the other four are waiting for the synchronization step so there's very little difference in how an imbalance might be handled.

As far as the problems with specific kernel versions, I don't know any more about it that has been posted elsewhere.

alpha754293 · Post by **alpha754293** » Wed Jan 21, 2009 5:08 am

bruce wrote:From a science perspective, the disadvantage of HyperThreading is that many people run two uniprocessor clients on an old P4 rather than one. This means that each one runs at something like 60% of the expected speed and folks get about 1.2x the number of points. Throughput is limited by 100% FPU utilization (using SSE), of course, but with a single non HyperThreaded client does spend a small amount of time waiting on integer instructions. By running two clients, all of the integer instructions can be overlapped with SSE processing. The scientific cost of running at 60% effective speed is higher than the benefit of doing more total work.

For the i7 I suppose there is a similar benefit from the integer/float overlap, but since all virtual CPUs are working on the same WU, there would be no 60% slowdown. Dividing a WU into 8 parallel segments rather than 4 gives back what HyperThreading takes away. Any effects of FPU saturation would still be applied to the same WU anyway.

Of course if the 8 segments are unequal you still get a full CPU working on the last 4 segments while the other four are waiting for the synchronization step so there's very little difference in how an imbalance might be handled.

As far as the problems with specific kernel versions, I don't know any more about it that has been posted elsewhere.

That's an interesting point that you bring up Bruce because I've actually tried running two SMP clients on an AMD 4-core system, and it doesn't work so well.

Maybe the AMDs are just better at handling the workload although they definitely don't do it as fast as the Intel's do.

(e.g. on an 8-core system, if you have 3 SMP clients running, expect a 25% overall drop off in speed.) (As one of the two 4-core instances will be running slower than the other.)

Here's the thread about the performance difference between the Ubuntu versions.
viewtopic.php?f=16&t=7998&hilit=Ubuntu

*edit*
correction:
"...expect a 25% overall drop off in speed."

that should have read "...expect a 33% overall drop off in speed."

Post by **bruce** » Wed Jan 21, 2009 5:57 am

Running two SMP clients on a 4-core hardware is NOT reommended because it often causes cache thrashing, depending on how much memory is needed by the client compared to the size of the cache. The same would be true for the i7 except for the fact that when you divide the data into twice as many segments, those data segments get smaller.

alpha754293 · Post by **alpha754293** » Wed Jan 21, 2009 6:46 am

bruce wrote:Running two SMP clients on a 4-core hardware is NOT reommended because it often causes cache thrashing, depending on how much memory is needed by the client compared to the size of the cache. The same would be true for the i7 except for the fact that when you divide the data into twice as many segments, those data segments get smaller.

But if you're running the Linux SMP client with the "-smp 8" flag, wouldn't it effectively cause the same thing though?

I'd assume that it'd be trying to spawn 8 core threads, so...I would think that you'd end up with the same result. (even with HTT enabled)

(My quad-core right now has 8 GB of RAM).

The last time I tested it where the number of SMP clients running > # of cores; that system had 16 GB of RAM running Red Hat Enterprise Linux 4 AS.

I would also guess that the cache hits don't occur at exactly the same time, therefore; the chances of a cache collison or cache thrashing should be minimal should it not? (I have no idea. Again, I'm not a programmer.)

*edit*
I'd also have to guess that given the different projects and different cores, that the profile of the core/project combination would be different and therefore the cache profile will also be different, no?

*edit*
I'm trying to understand this, so bare with me and let me get this straight:

Linux SMP with Core i7 with HTT enabled - 8-cores is ok because it utilizes the extra free cycles in the CPU.

But, Linux SMP with quad-core processor (without HTT, or HTT N/A) - 8-cores is bad because or cache thrashing???

0.o? Seems like they're contradicting statements to me.

on the quad-core without HTT, it would see that there would be those free cycles still left, and the idea/intent on using two SMP clients (4-cores each), would tend to suggest that it is doing what a "-smp 8" flag would do.

conversely, if running 8-cores worth (either via two SMP (4-cores each) or with the "-smp 8" flag) = cache thrashing, since HTT aren't physical processors, wouldn't you get the same thing?

I don't get it.

P5-133XL · Post by **P5-133XL** » Wed Jan 21, 2009 8:33 am

If you need more time for benchmarking, then simply change the date on your computer to be inside the deadline. If you keep resetting the date to the original download date, you can continue bechmarking forever with effectively no time worrys.

alpha754293 · Post by **alpha754293** » Wed Jan 21, 2009 8:36 am

P5-133XL wrote:If you need more time for benchmarking, then simply change the date on your computer to be inside the deadline. If you keep resetting the date to the original download date, you can continue bechmarking forever with effectively no time worrys.

Wouldn't that have an adverse impact on the progress of the project overall?

P5-133XL · Post by **P5-133XL** » Wed Jan 21, 2009 8:44 am

Download a project; Make a copy of the folding folder, paying attention to the date; Let the project complete and send results; then for all bechmarks on that machine or others, simply change the date on the machine back to the download date; copy the copy and fold from there.

There is no interference to the project, because the data has been returned and Stanford can continue forward with no delay. if the copy is ever accidentally returned then it will be rejected as a duplicate. The folding can be repeated over and over because the computer date has changed, so folding will not think the deadline has passed and quit.

alpha754293 · Post by **alpha754293** » Wed Jan 21, 2009 8:49 am

P5-133XL wrote:Download a project; Make a copy of the folding folder, paying attention to the date; Let the project complete and send results; then for all bechmarks on that machine or others, simply change the date on the machine back to the download date; copy the copy and fold from there.

There is no interference to the project, because the data has been returned and Stanford can continue forward with no delay. if the copy is ever accidentally returned then it will be rejected as a duplicate. The folding can be repeated over and over because the computer date has changed, so folding will not think the deadline has passed and quit.

Thanks for the tip. I think that I'm just going to do just that then.

That solves the problem. All outstanding issues are resolved cuz of that. Thanks.

Ok. So here comes the next question:

what OSes do y'all want me to benchmark?

System hardware:
4x AMD Opteron 880 (2.4 GHz, dual-core) (8 cores total)
16 GB DDR 400 ECC Reg.

alpha754293 · Post by **alpha754293** » Fri Jan 23, 2009 8:39 am

Prepping system for benchmarking: SuSE Linux Enterprise Desktop 10 SP2 x64...

7im · Post by **7im** » Fri Jan 23, 2009 10:36 pm

P5-133XL wrote:Download a project; Make a copy of the folding folder, paying attention to the date; Let the project complete and send results; then for all bechmarks on that machine or others, simply change the date on the machine back to the download date; copy the copy and fold from there.

There is no interference to the project, because the data has been returned and Stanford can continue forward with no delay. if the copy is ever accidentally returned then it will be rejected as a duplicate. The folding can be repeated over and over because the computer date has changed, so folding will not think the deadline has passed and quit.

Two suggestions to improve upon the process. Disable the deadlines in the client setup, then it doesn't matter what the system date is, and you no longer need to worry about that, change it, etc. Also consider running with the "Prompt before Connect" setting. The client will finish the WU, and prompt you to connect. Cancel out so you don't waste time uploading a duplicate, and don't waste the bandwidth on the internet or on the Stanford servers. The WU has to completely upload to Stanford before it gets kicked as a dupe. Just don't upload and save the bandwidth for the rest of us uploading work that is not yet done.

Thanks.

(I was going to point you to the previous thread with tips on benchmarking using the same WU repeatedly, but it was in the previous iteration of this forum. Nevermind.)

Folding Forum

benchmarking F@H

benchmarking F@H

Re: benchmarking F@H

Re: benchmarking F@H

Re: benchmarking F@H

Re: benchmarking F@H

Re: benchmarking F@H

Re: benchmarking F@H

Re: benchmarking F@H

Re: benchmarking F@H

Re: benchmarking F@H

Re: benchmarking F@H

Re: benchmarking F@H

Re: benchmarking F@H

Re: benchmarking F@H

Re: benchmarking F@H