CPU Architecture and FAH
Moderator: Site Moderators
Forum rules
Please read the forum rules before posting.
Please read the forum rules before posting.
-
- Posts: 10179
- Joined: Thu Nov 29, 2007 4:30 pm
- Hardware configuration: Intel i7-4770K @ 4.5 GHz, 16 GB DDR3-2133 Corsair Vengence (black/red), EVGA GTX 760 @ 1200 MHz, on an Asus Maximus VI Hero MB (black/red), in a blacked out Antec P280 Tower, with a Xigmatek Night Hawk (black) HSF, Seasonic 760w Platinum (black case, sleeves, wires), 4 SilenX 120mm Case fans with silicon fan gaskets and silicon mounts (all black), a 512GB Samsung SSD (black), and a 2TB Black Western Digital HD (silver/black).
- Location: Arizona
- Contact:
Re: CPU Architecture and FAH
There isn't enough need for double precision work (SSE2) to warrant a multicore client. The SMP client is SSE only as I recall.
How to provide enough information to get helpful support
Tell me and I forget. Teach me and I remember. Involve me and I learn.
Tell me and I forget. Teach me and I remember. Involve me and I learn.
-
- Posts: 87
- Joined: Tue Jul 08, 2008 2:27 pm
- Hardware configuration: 1x Q6600 @ 3.2GHz, 4GB DDR3-1333
1x Phenom X4 9950 @ 2.6GHz, 4GB DDR2-1066
3x GeForce 9800GX2
1x GeForce 8800GT
CentOS 5 x86-64, WINE 1.x with CUDA wrappers
Re: CPU Architecture and FAH
The two are unrelated. It doesn't matter whether double or single precision numbers are used - that part is WU/core specific. There is no issue of efficiency here. Doubles are only used where they are needed - it is not about efficiency, it's about necessity.
The other point about utilization is debatable one. The problem is that the thread scheduler in the current implementation is very naive, and because task switching always comes with a penalty (when there are more tasks than cores) the throughput of a whole unit will end up being bound by the performance of the slowest thread. These overheads, made worse by unshared caches on core pairs of a quad Core2, are what leads to inefficiency. This is why running 4 SMP clients, each affinity bound to one core, ends up yielding more PPD than running a single SMP client. The bandwidth increase because the caches suffer fewer misses (the processes don't end up migrating between the core pairs), and you overbook the CPU time more so the time that would normally be spent idle due to the slowest thread scaling limitation ends up being utilized. Running multiple clients each bound to one core will end up yielding higher throughput, but it will also almost equally worse latency, and this is a problem because when a particularly interesting protein folding case needs to be analyzed, it is more important to get each time frame (WU) as quickly as possible so that they can be analyzed and researched further, which guides the creation of new WUs.
The switching overheads and cache sharing issues are significantly reduced on CPUs that don't consist of multiple physical dies, as I said in an earlier post. Core2 Duo, Nehalem and Phenom CPUs don't get as hammered by the overheads because although the WU is still bound by the speed of the slowest thread, switching between threads is faster.
The other point about utilization is debatable one. The problem is that the thread scheduler in the current implementation is very naive, and because task switching always comes with a penalty (when there are more tasks than cores) the throughput of a whole unit will end up being bound by the performance of the slowest thread. These overheads, made worse by unshared caches on core pairs of a quad Core2, are what leads to inefficiency. This is why running 4 SMP clients, each affinity bound to one core, ends up yielding more PPD than running a single SMP client. The bandwidth increase because the caches suffer fewer misses (the processes don't end up migrating between the core pairs), and you overbook the CPU time more so the time that would normally be spent idle due to the slowest thread scaling limitation ends up being utilized. Running multiple clients each bound to one core will end up yielding higher throughput, but it will also almost equally worse latency, and this is a problem because when a particularly interesting protein folding case needs to be analyzed, it is more important to get each time frame (WU) as quickly as possible so that they can be analyzed and researched further, which guides the creation of new WUs.
The switching overheads and cache sharing issues are significantly reduced on CPUs that don't consist of multiple physical dies, as I said in an earlier post. Core2 Duo, Nehalem and Phenom CPUs don't get as hammered by the overheads because although the WU is still bound by the speed of the slowest thread, switching between threads is faster.
-
- Posts: 1024
- Joined: Sun Dec 02, 2007 12:43 pm
Re: CPU Architecture and FAH
Maybe I can explain it in a different way than either Bruce or ShatteredSilicon.
Suppose two donors have different hardware and one gets a very good PPD/Watt and the other gets a very poor PPD/Watt. The Pande Group is going to assign a project to both of them indiscriminately.
You see, the Pande Group has a different perspective than an individual user. They use donated resources and they don't have any way to know how many PPD/Watt is being used. They're only concerned about the turn-around time for each assignment. You, on the other hand, measure your performance primarily by the points you earn. Optimizing with two different goals sometimes leads to different measures of what is "best"
If your machine has SSE2, then it also has the single-precision instructions (SSE). If your machine is multi-cored, it is capable of running SMP. If you can do both, then the Pande Group really doesn't care which feature you use, as long as somebody else can run the "other" WU that you might have run instead.
From your perspective, however, there is probably a "best" that might include SSE2 or might include SMP but since it won't be doing both at the same time, you'll have to choose which works best for you. remember, though, that you cannot CHOOSE projects with SSE2, they just happen, and not on any pre-announced frequency. You CAN choose SMP and if you do, that's the only thing you'll get.
Suppose two donors have different hardware and one gets a very good PPD/Watt and the other gets a very poor PPD/Watt. The Pande Group is going to assign a project to both of them indiscriminately.
You see, the Pande Group has a different perspective than an individual user. They use donated resources and they don't have any way to know how many PPD/Watt is being used. They're only concerned about the turn-around time for each assignment. You, on the other hand, measure your performance primarily by the points you earn. Optimizing with two different goals sometimes leads to different measures of what is "best"
If your machine has SSE2, then it also has the single-precision instructions (SSE). If your machine is multi-cored, it is capable of running SMP. If you can do both, then the Pande Group really doesn't care which feature you use, as long as somebody else can run the "other" WU that you might have run instead.
From your perspective, however, there is probably a "best" that might include SSE2 or might include SMP but since it won't be doing both at the same time, you'll have to choose which works best for you. remember, though, that you cannot CHOOSE projects with SSE2, they just happen, and not on any pre-announced frequency. You CAN choose SMP and if you do, that's the only thing you'll get.
-
- Posts: 87
- Joined: Tue Jul 08, 2008 2:27 pm
- Hardware configuration: 1x Q6600 @ 3.2GHz, 4GB DDR3-1333
1x Phenom X4 9950 @ 2.6GHz, 4GB DDR2-1066
3x GeForce 9800GX2
1x GeForce 8800GT
CentOS 5 x86-64, WINE 1.x with CUDA wrappers
Re: CPU Architecture and FAH
Another thing worth pointing out that SSE2 capable units will still run on SSE1-only hardware, they'll just run slower. The core auto-detects what can be used and uses whatever is available. Any floating point features that are missing from the hardware being used get executed using the 387 FPU calls. In fact, you don't even need SSE - you can run the classic non-SMP client on a Pentium Pro, but it'll run _really_ slowly.
-
- Posts: 10179
- Joined: Thu Nov 29, 2007 4:30 pm
- Hardware configuration: Intel i7-4770K @ 4.5 GHz, 16 GB DDR3-2133 Corsair Vengence (black/red), EVGA GTX 760 @ 1200 MHz, on an Asus Maximus VI Hero MB (black/red), in a blacked out Antec P280 Tower, with a Xigmatek Night Hawk (black) HSF, Seasonic 760w Platinum (black case, sleeves, wires), 4 SilenX 120mm Case fans with silicon fan gaskets and silicon mounts (all black), a 512GB Samsung SSD (black), and a 2TB Black Western Digital HD (silver/black).
- Location: Arizona
- Contact:
Re: CPU Architecture and FAH
One should also note that an SSE2 work unit will fold at the same Points Per Day as a SSE work unit on CPUs that only have SSE.
How to provide enough information to get helpful support
Tell me and I forget. Teach me and I remember. Involve me and I learn.
Tell me and I forget. Teach me and I remember. Involve me and I learn.
Re: CPU Architecture and FAH
I think this is most specifically helpful, although Bruce and shattered have done a good job of addressing the specifics I think you see what I was getting at. If I'm understanding you correctly, the Pande group doesn't have any plans to further diversify clients to take advantage of this because the need is not present. The only metrics they are using for the project's efficiency is the turnaround time that work returns to their servers.codysluder wrote:Maybe I can explain it in a different way than either Bruce or ShatteredSilicon.
Suppose two donors have different hardware and one gets a very good PPD/Watt and the other gets a very poor PPD/Watt. The Pande Group is going to assign a project to both of them indiscriminately.
You see, the Pande Group has a different perspective than an individual user. They use donated resources and they don't have any way to know how many PPD/Watt is being used. They're only concerned about the turn-around time for each assignment. You, on the other hand, measure your performance primarily by the points you earn. Optimizing with two different goals sometimes leads to different measures of what is "best"
I was mistakenly under the impression that they would approach the donators' machines as a large cluster, and actively approach issues of efficiency and reduction of overhead by releasing cores & work that would take the best advantage of the hardware currently being used in the cluster. Instead it sounds more like a magic box where work goes in and results come out, and they don't look inside the box. I'm not looking for Stanford to micro-manage or trying to give them more work to do. But my idea was that since there are so many dual cores, something could be released to take particular advantage of them in addition to the existing work. I'm certain that it would shift things around (current SMP units not being done on those duals) but may provide some reduced overhead (dual core work could be used to 'check' other projects) allowing Stanford to work with more projects and potentially wait for fewer runs to get the same result.
The reason I brought this up from the start is I am considering a new standalone folding system but there isn't a 'best' choice in terms of cost effectiveness of the parts I would be buying vs overall impact they will have to the project over their lifetime. A PS3 is probably the best investment of PPD/$ but if everyone just goes to PS3's then the project suffers in the long run. Diversifying the hardware the donors use is good to an extent, but to some degree as well it would be ideal as a donor to make sure that I am providing lasting worth to the project in my purchases. I am not looking for any 'official hardware list.' This gives me some insight to some underlying goals that may otherwise not be very clear.
helping find cures since Dec 2004
Folding Wolves (186785)
Folding Wolves (186785)
-
- Posts: 823
- Joined: Tue Mar 25, 2008 12:45 am
- Hardware configuration: Core i7 3770K @3.5 GHz (not folding), 8 GB DDR3 @2133 MHz, 2xGTX 780 @1215 MHz, Windows 7 Pro 64-bit running 7.3.6 w/ 1xSMP, 2xGPU
4P E5-4650 @3.1 GHz, 64 GB DDR3 @1333MHz, Ubuntu Desktop 13.10 64-bit
Re: CPU Architecture and FAH
A PS3 would not actually be the best PPD/$. You could easily build a computer for $400 that would greatly outperform the 900 PPD of the PS3 (a dual-core Pentium w/ a 96-shader 9600 GSO would produce a few thousand PPD, for instance). Of course the PS3 is useful, and I am in no way discounting its folding capabilities.
-
- Posts: 87
- Joined: Tue Jul 08, 2008 2:27 pm
- Hardware configuration: 1x Q6600 @ 3.2GHz, 4GB DDR3-1333
1x Phenom X4 9950 @ 2.6GHz, 4GB DDR2-1066
3x GeForce 9800GX2
1x GeForce 8800GT
CentOS 5 x86-64, WINE 1.x with CUDA wrappers
Re: CPU Architecture and FAH
I rather suspect the best PPD/$ and PPD/W figure would likely go to one of these:
http://www.supermicro.com/products/moth ... .cfm?typ=H (£115+VAT in UK)
coupled with something like a good old 9800GX2. This mobo has a PCIe x8 slot, so it would probably work reasonably well.
http://www.supermicro.com/products/moth ... .cfm?typ=H (£115+VAT in UK)
coupled with something like a good old 9800GX2. This mobo has a PCIe x8 slot, so it would probably work reasonably well.
Re: CPU Architecture and FAH
That was just one example. The point is if we all start using the same thing to the exclusion of other clients then the project suffers in the long term. So if all users are going after the best PPD/whatever then we may be doing more harm than good.
And when you consider the $ you pay for power, PS3's are pretty close to the top in overall production per investment (purchase + wattage).
And when you consider the $ you pay for power, PS3's are pretty close to the top in overall production per investment (purchase + wattage).
helping find cures since Dec 2004
Folding Wolves (186785)
Folding Wolves (186785)
-
- Posts: 10179
- Joined: Thu Nov 29, 2007 4:30 pm
- Hardware configuration: Intel i7-4770K @ 4.5 GHz, 16 GB DDR3-2133 Corsair Vengence (black/red), EVGA GTX 760 @ 1200 MHz, on an Asus Maximus VI Hero MB (black/red), in a blacked out Antec P280 Tower, with a Xigmatek Night Hawk (black) HSF, Seasonic 760w Platinum (black case, sleeves, wires), 4 SilenX 120mm Case fans with silicon fan gaskets and silicon mounts (all black), a 512GB Samsung SSD (black), and a 2TB Black Western Digital HD (silver/black).
- Location: Arizona
- Contact:
Re: CPU Architecture and FAH
I wouldn't worry about PPD causing any work to be excluded cheechi. Not everyone buys their home or work computer based solely on getting the best PPD. Sure, a lot of enthusiasts on this forum do, but many people buy a PC for many other reasons, and then just happen to also fold. Also consider that in this economy, few people are ugprading, so they keep folding on older hardware. There is still more than 100,000 CPU clients folding, and they don't get the best PPD but still keep going.
How to provide enough information to get helpful support
Tell me and I forget. Teach me and I remember. Involve me and I learn.
Tell me and I forget. Teach me and I remember. Involve me and I learn.
Re: CPU Architecture and FAH
Sounds like a plan. Thanks.
helping find cures since Dec 2004
Folding Wolves (186785)
Folding Wolves (186785)
Re: CPU Architecture and FAH
The donators' machines are a large cluster and they DO activel approach issues of efficiency just as you suggest . . . but only where it makes sense. For example, SMP and GPU and CPU assignments are managed separately and work is assigned to take advantage of those features. Not everything is worth managing. For example, suppose the CPU client could detect cache size and report that back to the server. The server could then assign different WUs based on that, but the system-wide throughput improvement compared to the amount of server logic required to manage it would mean that makes no sense to do it.cheechi wrote:I was mistakenly under the impression that they would approach the donators' machines as a large cluster, and actively approach issues of efficiency and reduction of overhead by releasing cores & work that would take the best advantage of the hardware currently being used in the cluster. Instead it sounds more like a magic box where work goes in and results come out, and they don't look inside the box. I'm not looking for Stanford to micro-manage or trying to give them more work to do. But my idea was that since there are so many dual cores, something could be released to take particular advantage of them in addition to the existing work. I'm certain that it would shift things around (current SMP units not being done on those duals) but may provide some reduced overhead (dual core work could be used to 'check' other projects) allowing Stanford to work with more projects and potentially wait for fewer runs to get the same result.
The client does report how many cores you have and how much RAM you have. That information is sometimes used to customize assignments, but putting resources into managing assignments at this level is less productive than putting that same effort into making the GPU and/or SMP cores more reliable.
Posting FAH's log:
How to provide enough info to get helpful support.
How to provide enough info to get helpful support.
-
- Posts: 9
- Joined: Thu Jul 09, 2009 5:52 pm
- Hardware configuration: i7 920 on EVGA x58 SLI Vanilla @ 3.66 GHz Watercooled
6GB DDR3
3x GTX 280 GPU
4 notfreds smp clients, 3 GPU clients - Location: Bozeman, MT
Re: CPU Architecture and FAH
I am awed by the depth of understanding present in this forum. I stand to learn a lot! I suppose I would like, for a moment, to dumb this conversation down and ask the point-blank question:
Other than in terms of pure PPD, am I gaining anything (besides a heftier power bill) by utilizing an Intel VT/HT capable to SMP fold on (NotFreds clients since they're easy to set up and, um, I'm not the most computer literate guy in the world) or would I be better off running some sort of simple CPU client, leaving GPU folding aside. Where, other than raw consumption of wattage and subsequent costs (ok, and in terms of cpu cost, although that's really moot for me since i'm not buying an extreme chip) would I see a gain? I'm considering a $50 single core without VT/HT vs a $220 quad with VT/HT.
My purpose is to build a (just a single, for now) folding rig, with PPD as one of my goals (hence GPU folding, but as I said, another discussion). As new as I am to the entire folding scene, I'm not sure I know how to configure my NotFreds SMPs to do the most good (this is, after all, philanthropy on my part), but given a baseline configuration, do I gain anything besides PPD from running a more than 4x expensive chip?
Other than in terms of pure PPD, am I gaining anything (besides a heftier power bill) by utilizing an Intel VT/HT capable to SMP fold on (NotFreds clients since they're easy to set up and, um, I'm not the most computer literate guy in the world) or would I be better off running some sort of simple CPU client, leaving GPU folding aside. Where, other than raw consumption of wattage and subsequent costs (ok, and in terms of cpu cost, although that's really moot for me since i'm not buying an extreme chip) would I see a gain? I'm considering a $50 single core without VT/HT vs a $220 quad with VT/HT.
My purpose is to build a (just a single, for now) folding rig, with PPD as one of my goals (hence GPU folding, but as I said, another discussion). As new as I am to the entire folding scene, I'm not sure I know how to configure my NotFreds SMPs to do the most good (this is, after all, philanthropy on my part), but given a baseline configuration, do I gain anything besides PPD from running a more than 4x expensive chip?
-
- Posts: 87
- Joined: Tue Jul 08, 2008 2:27 pm
- Hardware configuration: 1x Q6600 @ 3.2GHz, 4GB DDR3-1333
1x Phenom X4 9950 @ 2.6GHz, 4GB DDR2-1066
3x GeForce 9800GX2
1x GeForce 8800GT
CentOS 5 x86-64, WINE 1.x with CUDA wrappers
Re: CPU Architecture and FAH
You'll need to define your goals more precisely. Are you looking for maximum PPD/$ in terms of initial investment? Or PPD/W (ongoing electricity costs)? There are threads elsewhere on this forum for PPD/$ and PPD/W figures for most commonly available hardware.
(Somebody also threatened to start PPD/m^3 lists, but thankfully, it didn't happen. )
(Somebody also threatened to start PPD/m^3 lists, but thankfully, it didn't happen. )
-
- Posts: 9
- Joined: Thu Jul 09, 2009 5:52 pm
- Hardware configuration: i7 920 on EVGA x58 SLI Vanilla @ 3.66 GHz Watercooled
6GB DDR3
3x GTX 280 GPU
4 notfreds smp clients, 3 GPU clients - Location: Bozeman, MT
Re: CPU Architecture and FAH
I'm fairly confident in my understanding of my PPD/$ ratios on these various levels of technology, I guess my question revolves directly around a lack of understanding of efficiency/architecture. What I'm gathering is that a quad core is more capable than a dual core (2x cores=2x work), although as you mentioned with Core2 architecture it outstrips anything previous, and i7 (assumedly) exceeds Core2. I guess i'd amend my question to read, 'Is there any benefit to be gained by running a core2 duo vs a core2 quad?'
Any meaningful PPD I achieve will be through GPU clients once I'm set up. In terms of what architecture is most efficient (not necessarily the highest PPD, although they may coincide), how do the available technologies rank? From the perspective of Pande Group, for example, who doesn't have to worry about my electricity bill. What would they prefer, so to speak, that I use? (The duh answer being that anything is better than nothing, but beyond that...)
Any meaningful PPD I achieve will be through GPU clients once I'm set up. In terms of what architecture is most efficient (not necessarily the highest PPD, although they may coincide), how do the available technologies rank? From the perspective of Pande Group, for example, who doesn't have to worry about my electricity bill. What would they prefer, so to speak, that I use? (The duh answer being that anything is better than nothing, but beyond that...)