ICC vs GCC will make effectively no difference to the performance of Gromacs, since the the innermost loops of the calculation are pure assembler (with Fortran loops as a fallback).
If you're not convinced, take a look at the comparative numbers in this mailing list entry: http://www.gromacs.org/pipermail/gmx-de ... 01473.html
Linux SMP v6 compared to Windows SMP client
Moderators: Site Moderators, FAHC Science Team
-
- Site Admin
- Posts: 1288
- Joined: Fri Nov 30, 2007 9:37 am
- Location: Oxfordshire, UK
Re: Linux SMP v6 compared to Windows SMP client
The hand-coded inner loops in Gromacs make excellent use of the SSE(2) instruction sets, which is where a good compiler might have done some improvements, but not as good as Assembly Language Code. That's quite different than what happens with SMP which involves inter-CPU communications.
Any process consists of segments that can be parallelized and segments that must be serial.
Example: If there are 1000 atoms and you want to find the closest one to a selected atom, you can do 999 distance calculations in parallel and then you have to sort the results (mostly a serial process). If the distance calculations have to account for Einstein's curvature of space, then some distance calculations are going to take longer to compute than others, and you can't finish the sort until the last distance calculation is finished -- so 998 of the parallel processors have to wait while only #999 is busy, reducing utilization. Use all those distance values at Time=t1. Repeat at Time=t2 . . .
The outer loop consists of Time from some T1 to some T2 and then a new WU is generated for T2 to T3. . . for Time = 0 to a very large number.
(Of course Gromacs is doing a lot more than calculating distances and curvature of space isn't really an issue, nor is the explicit sort, but this is a fictitious example.)
Any process consists of segments that can be parallelized and segments that must be serial.
Example: If there are 1000 atoms and you want to find the closest one to a selected atom, you can do 999 distance calculations in parallel and then you have to sort the results (mostly a serial process). If the distance calculations have to account for Einstein's curvature of space, then some distance calculations are going to take longer to compute than others, and you can't finish the sort until the last distance calculation is finished -- so 998 of the parallel processors have to wait while only #999 is busy, reducing utilization. Use all those distance values at Time=t1. Repeat at Time=t2 . . .
The outer loop consists of Time from some T1 to some T2 and then a new WU is generated for T2 to T3. . . for Time = 0 to a very large number.
(Of course Gromacs is doing a lot more than calculating distances and curvature of space isn't really an issue, nor is the explicit sort, but this is a fictitious example.)
Posting FAH's log:
How to provide enough info to get helpful support.
How to provide enough info to get helpful support.
-
- Posts: 87
- Joined: Tue Jul 08, 2008 2:27 pm
- Hardware configuration: 1x Q6600 @ 3.2GHz, 4GB DDR3-1333
1x Phenom X4 9950 @ 2.6GHz, 4GB DDR2-1066
3x GeForce 9800GX2
1x GeForce 8800GT
CentOS 5 x86-64, WINE 1.x with CUDA wrappers
Re: Linux SMP v6 compared to Windows SMP client
Fair enough. I generally don't resort to assembler if a compiler can do a reasonable job. GCC didn't, but ICC did, and if it was a choice between hand-optimizing assembler and using ICC, it was a no-brainer. For a start, it means I can still use the exact same code without ugly ifdefs on a different platform (e.g. PPC with XLC). But if innermost loops are already in good vectorized assembly, then I guess there's not much to be gained.
Thanks again for the explanation.
Thanks again for the explanation.
Re: Linux SMP v6 compared to Windows SMP client
That code is open-sourced at www.gromacs.org. Feel free to suggest improvements, but Eric is one of the best optimizing compilers that I know.shatteredsilicon wrote:if innermost loops are already in good vectorized assembly, then I guess there's not much to be gained.
Posting FAH's log:
How to provide enough info to get helpful support.
How to provide enough info to get helpful support.
-
- Posts: 87
- Joined: Tue Jul 08, 2008 2:27 pm
- Hardware configuration: 1x Q6600 @ 3.2GHz, 4GB DDR3-1333
1x Phenom X4 9950 @ 2.6GHz, 4GB DDR2-1066
3x GeForce 9800GX2
1x GeForce 8800GT
CentOS 5 x86-64, WINE 1.x with CUDA wrappers
Re: Linux SMP v6 compared to Windows SMP client
Great, thanks for pointing out the OSS code. I'll go have a dig through that.