Options for running one job across multiple clustered Pi's?

Moderators: Site Moderators, FAHC Science Team

Post Reply
tvmwd6
Posts: 5
Joined: Fri Sep 03, 2021 3:36 pm

Options for running one job across multiple clustered Pi's?

Post by tvmwd6 »

I'm working on a pi cluster project and trying to figure out the most effective way to use it to run f@h.

I have completed and submitted a WU from a single Pi 3B+ in around 23hrs. Now I'd like to try distributing a WU across multiple Pi's in order to reduce the time to completion. I haven't found a conclusive answer whether f@h is MPI aware or not, if so that would be simple enough. Otherwise what options are available to me? Has anyone done this with non-ARM f@h?

Any insight is appreciated.
MeeLee
Posts: 1339
Joined: Tue Feb 19, 2019 10:16 pm

Re: Options for running one job across multiple clustered Pi

Post by MeeLee »

Fah doesn't work like this.
If you would spread a WU across multiple units, data from one thread will have to move through an immensely slow network connection to another unit.
This data currently is shared amongst CPUs sharing super fast L-cache and fast RAM.
You'll be slowing down the WU by a multitude of 100 times.

The best option is to just load more WUs.
That way your WU will still process in 23 hours, but you now have multiple units, finishing in that time frame.
The best way to do that, is have an OS on each unit, and install FAH on each unit with your username and password.
And let them rip!

I myself use a cluster PC in that manner (2 towers of 20 units). Just x86 boards, not ARM boards, although I do have a 20 cluster ARM unit as well.
ETA_2025
Posts: 73
Joined: Mon Jan 30, 2023 10:43 am
Hardware configuration: NVIDIA RTX 4070
10 x Raspberry Pi 5 Model B 2GB RAM
10 x Raspberry Pi 4 Model B 2GB RAM
Location: VIC, Australia

Re: Options for running one job across multiple clustered Pi's?

Post by ETA_2025 »

It's more than a year since it became possible to do F@H on ARM hardware, and yet there's no ability to process a WU on a cluster of ARM hardware, to enable more projects to be completed on ARM hardware.

Is there a fundamental technical limitation to processing one WU in parallel across multiple Pi's, such as a WU is required to be processed in serial, and thus can't be broken into smaller chunks for processing, or it a case of, new software needs to be created, to allow it?
Image
JimboPalmer
Posts: 2522
Joined: Mon Feb 16, 2009 4:12 am
Location: Greenwood MS USA

Re: Options for running one job across multiple clustered Pi's?

Post by JimboPalmer »

No just the dreadfully slow dara transfer.
Tsar of all the Rushers
I tried to remain childlike, all I achieved was childish.
A friend to those who want no friends
Joe_H
Site Admin
Posts: 7922
Joined: Tue Apr 21, 2009 4:41 pm
Hardware configuration: Mac Pro 2.8 quad 12 GB smp4
MacBook Pro 2.9 i7 8 GB smp2
Location: W. MA

Re: Options for running one job across multiple clustered Pi's?

Post by Joe_H »

JimboPalmer is correct. No new software would be needed to split up processing over multiple networked systems, but the the rate of data transfer over that network would be much too slow.

For each step through the processing of a WU the forces between the atoms all need to be calculated. Using a single thread this is done in one pass through the system. As threads are added, the system is decomposed into separate slices and the forces between the atoms in each slice are calculated by each thread. Then the forces between the atoms in each slice need to be done with those in adjacent slices. That is where a large amount of data transfer occurs determining those inter-slice forces. It is also part of the reason behind the adoption of the QRB years ago. That additional inter-slice computations adds overhead. Thus a single thread will take a certain amount of time to complete operations, two threads will take a bit more than half that time, and so on. Early on some found they could make more points doing two or more separate WUs than a single WU over multiple processors. But that meant all those single thread WUs took longer overall and delayed creation and processing of the next generation WU for that project's Run and Clone.
Image

iMac 2.8 i7 12 GB smp8, Mac Pro 2.8 quad 12 GB smp6
MacBook Pro 2.9 i7 8 GB smp3
Post Reply