Page 1 of 15
Highwater Mark
Posted: Tue Mar 25, 2008 1:20 pm
by alancabler
We just keep getting faster and faster.
The project has now passed
1.5 PFLOPS
Code: Select all
OS Type Current TFLOPS* Active CPUs Total CPUs
Windows 179 188452 1965339
Mac OS X/PowerPC 7 8762 113529
Mac OS X/Intel 21 6724 43180
Linux 43 25559 279954
GPU 27 465 5261
PLAYSTATION®3 1225 40577 469305
Total 1502 270539 2876568
Edit: Fixed table layout -UF
Re: Highwater Mark
Posted: Tue Mar 25, 2008 1:35 pm
by John Naylor
What dya think... First project to 10PFLOPS by 2015?
Re: Highwater Mark
Posted: Tue Mar 25, 2008 2:04 pm
by alancabler
Hi
John,
John Naylor wrote:What dya think... First project to 10PFLOPS by 2015?
If the project speed increases according to Moore's Law,then we should be somewhere around 16 PFLOPS by 2015. Since our production depends on the number of donors, as well as machine speed and time spent folding, there are more variables to consider. OTOH, the Moore's Law timescale has been getting shorter, and there are some extraordinary tech advances in view.
Personal TFLOP contributions will become commonplace, and soon...
Re: Highwater Mark
Posted: Thu Mar 27, 2008 1:49 pm
by alancabler
Since this thread was started, just 2 days ago, donors have added 31 TFLOPS of folding power to the project.
http://fah-web.stanford.edu/cgi-bin/mai ... pe=osstats
That's more computing power than the world's 38th largest supercomputer (27.38 TFLOPS) at Lawrence Livermore National Laboratory.
http://www.top500.org/list/2007/11/100
The Folding@home project is currently running faster than the Top 12 supercomputers in the world combined,
plus the aforementioned Lawrence Livermore system (as of Nov. '07). Additionally, the project's power is calculated from sustained, actual performance, and not from theoretical peak performance.
note: F@h is a distributed computing project and is neither a classically defined supercomputer, nor a supercluster.
You will only find F@h mentioned rarely, if at all, in the publications dealing with supercomputers or superclusters.
Cluster computing is anathema to many supercomputing purists (their reasons have merit). Still, those involved in such pursuits studiously ignore the 800 pound gorilla in the room.
Re: Highwater Mark
Posted: Fri Mar 28, 2008 4:13 pm
by butc8
With the new GPU client on the way and a 3870 at +-1TF/core, we will see a 2x3870x2 run +-4TF on one computer
, probably next month, thats faster than some DC projects on just one computer! 1 yottaflop is just around the corner.
http://en.wikipedia.org/wiki/Peta-
Re: Highwater Mark
Posted: Fri Mar 28, 2008 6:04 pm
by Foxery
GPU 27 465 5261
I'm glad you posted this, so we have a reference saved for after the new GPU client arrives
Today it's:
GPU 27 450 5266
butc8 wrote:With the new GPU client on the way and a 3870 at +-1TF/core, we will see a 2x3870x2 run +-4TF on one computer
, probably next month, thats faster than some DC projects on just one computer! 1 yottaflop is just around the corner.
http://en.wikipedia.org/wiki/Peta-
Actually, the X1900 XT was also advertised as 1 TFLOP, so take it with a grain of salt. This doesn't directly translate into actual calculations relevant to protein folding. 27 TFLOPs / 450 GPUs = roughly 60 GFLOPs average per card. Even after considering that many don't crunch 24/7, and many are 1600-series, the actual work performed is nowhere near 1000 GFLOPs apiece, largely because Folding is more complex than rendering polygons.
A more conservative guess would be to believe that new cards are at least ~2X the speed, and at least ~2X as many people will Fold on them. I'd expect to see the GPU Client stats to be in the realm of 150 GFLOPs at the end of April. Hopefully the truth will be double this much again, as I think I am greatly underestimating. I would not, however, expect the GPU total to outpace PS3s any time soon.
Re: Highwater Mark
Posted: Fri Mar 28, 2008 6:12 pm
by zorzyk
How can we translate power of SMP client into flops?
I'd like to know, what is the percentage of SMP clients in total number 179 TFlops of Windows OS.
Is it possible to estimate?
Re: Highwater Mark
Posted: Fri Mar 28, 2008 6:28 pm
by butc8
AFAIK the SMP client is effient (it splits up the WU), on PC Wizard Im getting about 13GF on my E8400@3.6.
Edit: There was this post that explained the GPU as the drag racer and CPU as mini van, now with the SMP they gonna bring in trailers and roof racks and off road tires haha
Re: Highwater Mark
Posted: Fri Mar 28, 2008 7:20 pm
by Foxery
I wonder if Stanford can break up the Client Statistics page into Uniprocessor and SMP figures.
edit:
Sorry Beberg. My "bad math" was worse than I thought.
My technical knowledge is too outdated to be more specific.
When my Friday headache clears, I'll do some reading!
Re: Highwater Mark
Posted: Fri Mar 28, 2008 9:01 pm
by Beberg
Foxery wrote:The Core 2 architecture has a pipeline depth of 14, meaning it takes 14 cycles to finish a complex instruction.
That's not how pipelining works. Please consult Wikipedia...
Re: Highwater Mark
Posted: Sat Mar 29, 2008 12:05 am
by Foxery
Shoot... Editted out my previous garbage entirely.
One of Core2's many improvements over the Pentium4 stems from its shorter pipeline, but glancing at a few old articles from its introduction, I didn't fully absorb the details.
butc8, I think the figure you are looking at shows either Integer ops, or possibly SSE-optimized instructions.
Now that I'm home and have some peace and quiet, I went out and downloaded LINPACK, an old, classic benchmark. Full description quoted below for the curious. Results for
one core out of my C2Duo, running at 3.4 GHz, rated me at 1.85 GFLOPs. This implies that all four cores in a Q6600, similarly overclocked, put out
7.4 GFLOPs. (A general purpose metric would be: 0.544 GFLOPs per-GHz per-Core.)
I also have a copy that's optimized for SSE2 instructions, which reports 2.08 GFLOPs/core... but Folding cores benefit far more from SSE2, so I'm not sure what to make of that.
LINPACK
The LINPACK Benchmarks are a measure of a system's floating point computing power. Introduced by Jack Dongarra, they measure how fast a computer solves a dense n by n system of linear equations Ax=b, which is a common task in engineering. It was written in Fortran by Jack Dongarra, Jim Bunch, Cleve Moler, and Pete Stewart, and was intended for use on supercomputers in the 1970s and early 1980s.
Re: Highwater Mark
Posted: Sat Mar 29, 2008 2:43 am
by alancabler
Greetings
Foxery,
Foxery wrote:
...I also have a copy that's optimized for SSE2 instructions, which reports 2.08 GFLOPs/core... but Folding cores benefit far more from SSE2, so I'm not sure what to make of that.
Pande Group has achieved 3.8 GFLOPS sustained performance from a 3.0 GHz P4 CPU using SSE intrinsics and Intel's C++ Compiler v9.0 running highly optimized, hand-coded Gromacs code, i.e. folding algorithms.*
1.
A QX9650 will yield 96 GFLOPS. *
2
GPUs- R580 (X1900) achieves 20-40X 2.8GHz P4 and has been shown in the old forum to also run FAHcode at + 96 GFLOPS. Current generation (RV670) ATI chips have demonstrated close to .5 TFLOPS performance.*
3
PS3 Cell processors running FAHcode achieved
~ 83 GFLOPS in the early implementations of the FAH client, but several code improvements have increased the PS3's performance. Sorry, all of my links to the info detailing the PS3's power were through the old folding-community.org forum, which system failures have rendered inaccessible.
*
1 N-Body Simulations on GPUs p.5
*
2 A Portable Run-Time Interface for Multi-level Memory Hierarchies p.8
*
3 ibid p.101
also see GPGPU
Re: Highwater Mark
Posted: Sat Mar 29, 2008 5:37 am
by bruce
butc8 wrote:With the new GPU client on the way and a 3870 at +-1TF/core, we will see a 2x3870x2 run +-4TF on one computer
, probably next month, thats faster than some DC projects on just one computer! 1 yottaflop is just around the corner.
Foxery wrote:Actually, the X1900 XT was also advertised as 1 TFLOP, so take it with a grain of salt. This doesn't directly translate into actual calculations relevant to protein folding. 27 TFLOPs / 450 GPUs = roughly 60 GFLOPs average per card. Even after considering that many don't crunch 24/7, and many are 1600-series, the actual work performed is nowhere near 1000 GFLOPs apiece, largely because Folding is more complex than rendering polygons.
A more conservative guess would be to believe that new cards are at least ~2X the speed, and at least ~2X as many people will Fold on them. I'd expect to see the GPU Client stats to be in the realm of 150 GFLOPs at the end of April. Hopefully the truth will be double this much again, as I think I am greatly underestimating. I would not, however, expect the GPU total to outpace PS3s any time soon.
You're both making one critical mistake. You're assuming that the number of FLOPS actually means something.
Well, it does mean
something, but it's not as meaningful as you're making it.
In the hypothetical 2x3870x2 machine, the limiting factor is going to be the PCI-e connection between main RAM and the gpu's VRAM. It's just not possible to move all of the data needed for protein folding in and out of the GPU fast enough to keep it 100% busy. That means that the useful flops are a lot smaller than the potential number that they like to advertise. Moreover if you actually figured out how to get four GPUs folding from the same motherboard, they'd have to share some or all of the bandwidth of the PCI-e bus so you'd get less than 4 times what one would do by itself.
Two x1950xtx's in the same machine fold a little faster than one, but certainly not double.
Re: Highwater Mark
Posted: Sat Mar 29, 2008 5:35 pm
by Foxery
Right, FLOPS only means Floating (Point) Operations. The result depends on the
type of Operations you are testing. Linpack is based on Linear Algebra, Folding@Home involves geometry. Popular ones these days are FutureMark, PCMark, Sandra, etc. Don't know what type of math they run, but the big numbers look exciting, right?
It might be more helpful to show the
relative speeds of machines throughout the years, rather than fixating on some magical number to represent "performance." Here are a few common examples, using Linpack as the benchmark of choice:
(source:
http://freespace.virgin.net/roy.longbot ... esults.htm)
Code: Select all
CPU MHz MFLOPS
Pentium III 450 61
Athlon 500 180
Pentium III 1000 316
Athlon-TBird 1000 372
Pentium 4 1700 382 (Note the poor performance vs. slower Athlons/P3s)
Athlon-Barton 1800 659 (Marketed as a "2500" vs. a Pentium4)
Opteron-? 2000 753
Pentium 4 3066 840
Athlon 64 2200 838
Core 2 Duo-1 Core 2400 1315
Core 2 Duo-1 Core 3400 1844
Re: Highwater Mark
Posted: Sat Mar 29, 2008 7:36 pm
by bruce
Foxery wrote:the big numbers look exciting, right?
right
It might be more helpful to show the relative speeds of machines throughout the years, rather than fixating on some magical number to represent "performance." Here are a few common examples, using Linpack as the benchmark of choice:
But they're not reasonable comparison.
Suppose my favorite benchmark is calculating a lot of square roots. On machine A, square root is caculated with a small subprogram that typically requires about 20 Floating Point OPerations. I want to compare that to machine B but this hardware happens to have a specialized Floating Point OPeration called SQRT that can perform the entire operation in one operation without the help of a subprogram. Machine B running at 200 MFLOPS is doing 20 times as much work as machine A which also is rated at 200 MFLOPS.
Now add machine C which can do 1,000 SQRT operations in the same time that machine B performs one of them (so it is rated at 200,000 MFLOPS or 4,000,000 MFLOPS, depending on which method you used in the first answer) but there's a catch. It can perform those operations only if the data is already in VRAM, and it can only load 200 M numbers per second into VRAM. I can load one value, find it's square root, but then it has to sit idle for 999 more operations before it gets the next number to work on. This machine is no faster than machine B, but it has a much higher MFLOP rating.
So now we have to ask the question: Does my benchmark that shows that machine C can only do 1/1000th USEFUL operations out of the rated MFLOPs representative of FAH code? Probably not, but you do get the point that USEFUL operations are all that really matters, not the big numbers.