imagine how much faster F@H would be in FORTRAN...

alpha754293 · Post by **alpha754293** » Wed Apr 15, 2009 6:21 am

Imagine how much faster F@H would be in FORTRAN...

http://www.cse.scitech.ac.uk/arc/reports/jaspa.pdf

MtM · Post by **MtM** » Wed Apr 15, 2009 7:25 am

Imagine what a delay it would mean to port everything to fortran, and I'm not sure but what does that pdf use for comparison? The gromacs core is the pride of one off our dutch universities, it's not a slouch by any means.

alpha754293 · Post by **alpha754293** » Wed Apr 15, 2009 7:32 am

MtM wrote:Imagine what a delay it would mean to port everything to fortran, and I'm not sure but what does that pdf use for comparison? The gromacs core is the pride of one off our dutch universities, it's not a slouch by any means.

Considering that FORTRAN was "designed" for numerical analysis, and computations; it would be a bit of a pain to port it, but given that it already runs; it wouldn't be like you're completely stopping it, but rather switching over. *shrug*. It's just an idea.

The PDF mentions the codes that it uses to do the tests/comparison. It's a bit "dated", but it's what I've been able to find so far.

Post by **uncle_fungus** » Wed Apr 15, 2009 7:32 am

The force kernels in GROMACS, which are what constitute most of the calculations, are hand-coded for performance in assembly; porting any of the code to Fortran would make no difference.

alpha754293 · Post by **alpha754293** » Wed Apr 15, 2009 7:35 am

uncle_fungus wrote:The force kernels in GROMACS, which are what constitute most of the calculations, are hand-coded for performance in assembly; porting any of the code to Fortran would make no difference.

Hmm...that's interesting. I wonder how they would wrap the parallelization around that then. I'm guessing that they have some kind of partitioner/grid decomposition in order to do the parallelization then; and then the partitions may or may not have much in the way of interaction with each other.

d-con · Post by **d-con** » Wed Apr 15, 2009 11:56 am

Interesting. The article summary on page 1 said exactly what I would expect it to say without testing the hypothesis.

Namely, that there's not much difference when using a good optimizing compiler between C and Fortran90. Java, on the other hand, is not a language one would use for heavy number crunching. It really wasn't invented for that problem.

I'm not sure how you reached your conclusion in the OP. But thanks for the article link.

-David

Post by **Beberg** » Wed Apr 15, 2009 6:11 pm

Another paper that says C and FORTAN are about the same, Java is kinda slow, and gcc is garbage. There are literally many hundreds of papers that say this.

It's kinda the "hello world" of conference paper writing.

alpha754293 · Post by **alpha754293** » Wed Apr 15, 2009 6:25 pm

A lot of the simulation programs that I run/use are written in FORTRAN and they seem to do the computations rather quickly. And my boss has tried to make me learn how to program in it (didn't work, but he tried) and I remember that a lot of the intrinsic functions in FORTRAN needed to be specially "back-coded" in C/C++.

So while yes, you can manually tweak til you get there, imagine if you a) instead of spending that time tweaking, spent it on porting; then b) used whatever leftover time tweaking the ported version.

I don't know. I just imagine the possiblities then.

7im · Post by **7im** » Wed Apr 15, 2009 6:54 pm

Um, isn't Fortran 77 considered "Fortran" anymore? Can it be ported back to itself?

From the GROMACS FAQ (a good read!!!, considering most computations in FAH use Gromacs in one form or another)...

Is my system supported?

GROMACS is a recursive acronym for "GROMACS Runs On Most Of All Computer Systems"
Since we use GNU automatic configuration scripts you should in principle be able to compile GROMACS on any UNIX dialect, probably including Mac OS X. Contact us if you have any problems. At least Solaris, IRIX, Linux (both x86 and alpha), Tru64/Digital Unix, and AIX should be virtually problem-free. An ANSI C compiler is sufficient to compile GROMACS, but we definitely recommend a good Fortran 77 compiler too (performance-critical routines are available in fortran versions). You won't need Fortran on Linux/x86 where we provide even faster assembly loops!

http://www.gromacs.org/component/option ... temid,165/

Sorry Alpha, you're barking up the wrong tree pushing Fortran here. Fah is slower in Fortran.

Wish I could pull up the "Fortran" discussions in the old forum for background.

Post by **bruce** » Wed Apr 22, 2009 2:12 am

There's no doubt that a good FORTRAN optimizing compiler can make efficient code out of standard scientific calculations. When you're talking about large numbers of vector operations, though, it doesn't assemble the operations into SSE operations as effectively as well as hand-optimized code. That's why the inner loops are written in assembly language where it's possible to get almost four times as many operations completed as traditional x86 code. I've never looked closely at FORTRAN 90 but my first guess is that it can perform the vector operations in SSE, giving nearly three times as many operations as traditional scalar x86 code. Transforming 3D vectors into 4-way parallel operations takes intelligence, as well as knowing exactly the optimum level of loop unrolling.

The Linux assembly language loops are used in the GROMACS FahCore for Windows, too.

alpha754293 · Post by **alpha754293** » Wed Apr 22, 2009 5:24 am

bruce wrote:There's no doubt that a good FORTRAN optimizing compiler can make efficient code out of standard scientific calculations. When you're talking about large numbers of vector operations, though, it doesn't assemble the operations into SSE operations as effectively as well as hand-optimized code. That's why the inner loops are written in assembly language where it's possible to get almost four times as many operations completed as traditional x86 code. I've never looked closely at FORTRAN 90 but my first guess is that it can perform the vector operations in SSE, giving nearly three times as many operations as traditional scalar x86 code. Transforming 3D vectors into 4-way parallel operations takes intelligence, as well as knowing exactly the optimum level of loop unrolling.

The Linux assembly language loops are used in the GROMACS FahCore for Windows, too.

Uh...it depends.

Because they use MPI, I know that for CFD for example, usually what will happen is that the mesh will be split in n-partitions. Typically n is symmetrical and binary in nature, but there's really actually nothing that says it has to be.

During the partitioning process also, additional nodes are added at each of the interfaces for continuity and any heat and/or mass transfer.

I'm not 100% sure how the MD simulations work, but from reading the GROMACS user's manual, it does have a grid that it works off of, and therefore; I do presume that you can partition said grid, and then use inter-partition communications to pass data between sections.

And if you look at the the console output when it starts the SMP run, it actually will tell you the way that the grid is being decomposed (1x1x4) or in my SMP8 case, 2x2x2.

Fixed number simplify coding, but imposes somewhat artifical limits on what you can and can't do with it.

But considering that there's a mix of a1 and a2 WUs, Windows, Linux, OS X, etc. that's one HUGE difference. (Because in the CFD world, it'd be one version of the solver, and the only thing that changes is the input file feeding the solver.)

That way, any solver, on any platform, can run the input file without it being input file specific. The slight caveat to that is that if there's a new solver, the formatting of the input file will likely change somewhat, but if you pass it through a pre-processor that will reformat it for the newest solver, it'll do just fine.

And I also do agree that you can't get faster than assembly. But you can probably optimize your FORTRAN code the same. It won't be 100%, but it'd be close. But the advantage would be that you gain the speed benefits from everything around those assembly loops, which in some CFD codes, can account for 50% of the wall-clock time in a parallel setting.

Post by **bruce** » Wed Apr 22, 2009 6:18 am

I suspect that the box containing the protein is partitioned just as you say. The difference between CFD and MD is that within each partition there is a specific number of atoms and the forces on each atom must be computed from each other atom (or at least any that are "nearby"). With CFD, if the geometry is similar, you can expect the number of elements to be a similar order of magnitude, and each element only has to consider a specific number of nearby points so the computation depends on N, not on N*N. Moreover, the elements that cross the boundary are limited in number so the sync of the four tasks is "easy" Each partition is more or less N/4 as fast because each of the Ns is the same order of magnitude. In MD, the four values of N/4 are not necessarily the same order of magnitude, and when you square each one, the differences become even more pronounced.

Moreover, since many of the nearby atoms may be in a different partition, the coordinates from each V/4 partition must be interchanged with the other volumes, which demands high data flows between the four tasks.

Add to that the fact that during each step, the atoms can move from one partition to the next so even if your initial partition is "good" later ones may not be. (I suspect that the volume may not be dynamically repartitioned with core A1. but it has been a long time since I read the GROMACS manual.)

SSE optimizations still allow four FP operations to be processed in parallel within each FPU compared to whatever FORTRAN is able to manage. A quad core can process as many as 16 scalar operations simultaneously.

alpha754293 · Post by **alpha754293** » Wed Apr 22, 2009 7:36 pm

bruce wrote:I suspect that the box containing the protein is partitioned just as you say. The difference between CFD and MD is that within each partition there is a specific number of atoms and the forces on each atom must be computed from each other atom (or at least any that are "nearby"). With CFD, if the geometry is similar, you can expect the number of elements to be a similar order of magnitude, and each element only has to consider a specific number of nearby points so the computation depends on N, not on N*N. Moreover, the elements that cross the boundary are limited in number so the sync of the four tasks is "easy" Each partition is more or less N/4 as fast because each of the Ns is the same order of magnitude. In MD, the four values of N/4 are not necessarily the same order of magnitude, and when you square each one, the differences become even more pronounced.

Moreover, since many of the nearby atoms may be in a different partition, the coordinates from each V/4 partition must be interchanged with the other volumes, which demands high data flows between the four tasks.

Add to that the fact that during each step, the atoms can move from one partition to the next so even if your initial partition is "good" later ones may not be. (I suspect that the volume may not be dynamically repartitioned with core A1. but it has been a long time since I read the GROMACS manual.)

SSE optimizations still allow four FP operations to be processed in parallel within each FPU compared to whatever FORTRAN is able to manage. A quad core can process as many as 16 scalar operations simultaneously.

Well, you are right. Generally, any deviation in terms of the shape and size of elements that are within some local proximity to another is at most, either a power or an exponential function; and nothing to dramatic. In fact, for CFD, great care is taken to ensure that sudden jumps DON'T appear in the mesh because it causes numerical issues in solving the governing equations and leads to convergence issues.

The one type of flow that I can think of where the accuracy would be O(N^2) would be in chemical reacting flows. Elements never cross the boundaries.

You can actually transform the coordinates such that your dx, dy, and dz spacings are all identical; where you are really remapping your grid in order to accomplish that. Then you will have equal spacing, and your problem becomes O(N) rather than O(N^2) as you mentioned; thus "simplifying" your problem.

Using a transform coordinate system is certainly nothing new or earth shattering by any stretch of the imagination, but it definitely makes it a heck of a lot easier to deal with computationally and it is a perfectly acceptable practice.

As far as migration goes, you can potentially add it as either a source or sink terms to each partition. Course, I don't see how migration is any bit related to being a FORTRAN issue rather than a MD issue (regardless of programming language).

In fact, I think that all of the points you've mentioned aren't related to it being FORTRAN issue at all. Regardless of language and partitioning method, it still has to be handled/treated in some way, and seeing as how it's already currently implemented with a mix of C/C++/FORTRAN/Assembly, obviously it's already in there and it's already been taken care of. Therefore; porting it to FORTRAN will speed up the remainder of the calculations and everything else that's built around the assembly code. (I think -- I'm no programmer, but that's would be what I'd suspect it to happen).

7im · Post by **7im** » Wed Apr 22, 2009 11:49 pm

alpha754293 wrote:...Therefore; porting it to FORTRAN will speed up the remainder of the calculations and everything else that's built around the assembly code. (I think -- I'm no programmer, but that's would be what I'd suspect it to happen).

No offense intended, but the Gromacs people have been doing this for many years. They've gone to the trouble of hand coding just to tweak more performance out of Gromacs. If they could gain even more performance from Fortran, they would have converted a long time ago. And since they have not converted, and they ARE programmers, I disagree with your thinking about Fortran offering any kind of performance increase.

alpha754293 · Post by **alpha754293** » Thu Apr 23, 2009 12:10 am

7im wrote:
alpha754293 wrote:...Therefore; porting it to FORTRAN will speed up the remainder of the calculations and everything else that's built around the assembly code. (I think -- I'm no programmer, but that's would be what I'd suspect it to happen).
No offense intended, but the Gromacs people have been doing this for many years. They've gone to the trouble of hand coding just to tweak more performance out of Gromacs. If they could gain even more performance from Fortran, they would have converted a long time ago. And since they have not converted, and they ARE programmers, I disagree with your thinking about Fortran offering any kind of performance increase.

That's not always entirely true.

Usually, when you decide as a group to port something like GROMACS over to Fortran (almost irrespective of version), it is always a HUGE undertaking. There's a lot of time that will need to be dedicated in just debugging the port itself to make sure that a) all modules are still working like they're supposed to, and b) that the results coming from the Fortran ported version are simliar, if not identical to the ones that are generated by the current version.

Then, to make matters worse, you have limited resources as it is, and you also have to make sure that your human resources are just as versatile in FORTRAN as they are in C/C++/assembly. And they you add in the fact that nobody really likes to go backwards and back-port everything, which means that they as, a group, would have made the decision that (for example) GROMACS 5.0 will be entirely in FORTRAN (or assembly core, FORTRAN everywhere else).

Even for my work stuff, the native code is FORTRAN77 and we've talked numerous times about porting it over to f90/f95 and we still haven't done it.

I'll admit that I've never looked at the GROMACS code in detail to figure out what it is that it does exactly. But suffice it to say that even if the cores weren't hand-coded for performance, you gotta remember that FORTRAN as a language in and of itself was designed for computations. It is one of the big reasons why it is still very widely used together, much more than people would probably think/presume (especially since as lot of stays out of the limelight). While doing it in assembly works, and is capable of exceptional performance, I wonder how much time is spent manually optimizing it (where literally get so caught up in the details) that you miss the big picture entirely while doing the optimizations. Who's to say that FORTRAN can't be close, as quick, or quicker than the manually optimized routines?

And like I said before too, a lot of the intrinsic routines in FORTRAN aren't intrinsic in C/C++/assembly.

From the PDF link above, we can already see that FORTRAN is faster than C. Therefore; short of actually porting it myself (which obviously I can't do), what other evidence do you need?

Folding Forum

imagine how much faster F@H would be in FORTRAN...

imagine how much faster F@H would be in FORTRAN...

Re: imagine how much faster F@H would be in FORTRAN...

Re: imagine how much faster F@H would be in FORTRAN...

Re: imagine how much faster F@H would be in FORTRAN...

Re: imagine how much faster F@H would be in FORTRAN...

Re: imagine how much faster F@H would be in FORTRAN...

Re: imagine how much faster F@H would be in FORTRAN...

Re: imagine how much faster F@H would be in FORTRAN...

Re: imagine how much faster F@H would be in FORTRAN...

Re: imagine how much faster F@H would be in FORTRAN...

Re: imagine how much faster F@H would be in FORTRAN...

Re: imagine how much faster F@H would be in FORTRAN...

Re: imagine how much faster F@H would be in FORTRAN...

Re: imagine how much faster F@H would be in FORTRAN...

Re: imagine how much faster F@H would be in FORTRAN...