Page 7 of 7

Re: Do you need help?

Posted: Thu Apr 09, 2020 11:50 pm
by foldinghomealone2
Thanks for clarifying.

Re: Do you need help?

Posted: Fri Apr 10, 2020 12:27 am
by UofM.MartinK
I think they also mentioned that they do several runs with the same starting configuration, and those will run with slightly differences if the RNG is seeded differently each time. So they might, say, run the same WU 5-10 times, and comparing the results in the end serves as a validation, but more importantly it gives a statistical variation as a function of different random numbers (or even different CPUs it ran on, depending which math libraries are used - believe it or not, not all CPUs calculate floating point equally!) in the algorithm. In my own circuit simulations, I apply similar techniques to give me a sense of the stability of the genetic algorithm used for solving some of my problems. I thought that's the type of "regular" duplication they were referring to, but those are included in the 80k

Re: Do you need help?

Posted: Fri Apr 10, 2020 1:04 am
by PantherX
UofM.MartinK wrote:I think they also mentioned that they do several runs with the same starting configuration, and those will run with slightly differences if the RNG is seeded differently each time. So they might, say, run the same WU 5-10 times, and comparing the results in the end serves as a validation, but more importantly it gives a statistical variation as a function of different random numbers (or even different CPUs it ran on, depending which math libraries are used - believe it or not, not all CPUs calculate floating point equally!) in the algorithm...
Just to clarify when they say "differences", they mean the following changes (that I know off):
1) The starting point of each atom/molecule will vary randomly
2) There will be different temperatures of the system

Thus, each WU download is unique. Results from one WU will be used to build a chain of WUs which is essentially a trajectory. Here are some other explanations of PRCG:
viewtopic.php?p=315888#p315888
viewtopic.php?p=314895#p314895

Re: Do you need help?

Posted: Fri Apr 10, 2020 1:23 am
by alxbelu
UofM.MartinK wrote:Yeah, that was some good background info. The most striking takeaway for me, until now confirming my suspicion: they seem to get back about 80K work units per hour, and "theoretically" assign 120-140lk. This is not just because of faulty clients and clients never returning, that rate is much better. So this is the inefficiency or scaling problem I try to hunt down...
Was anything mentioned on the period of when this discrepancy of returned WUs occured? (I.e. was it looking over the last weeks, or last days/hours?)

About two weeks ago I noticed, similarly to others, that WUs that I had successfully completed & submitted to the CS (after failing an upload to the WS) only got base points.

I have not checked my logs since, but in one case (WU 11777 (0,4415,15)), the WU was uploaded, acknowledged and awarded base points by the CS. Two days later it was reassigned, and then it took over a week before my upload was registered in the WU page (at which point the other donor had also successfully completed & submitted the WU).

In another case (WU 11744 (0,597,7)) the WU was again uploaded, acknowledged and awarded base points by the CS. A day after my upload it was reassigned and since it, still to this day, has not been registered in the WU page, I assume that means it was simply lost.

While I get that stats updates and credits may be delayed, the issue here was obviously that the same WU was needlessly reassigned due to backend delays and/or actually losing track of a WU. If this is still an ongoing issue this could at least partly explain the discrepancy between assigned and received WUs.

Re: Do you need help?

Posted: Fri Apr 10, 2020 1:54 am
by bruce
Other DC projects (boinc, in particular) have had trouble with cheating which involved the creation of falsified data. As a standard, I think they always required each WU to be processed more than once as a validation step. FAH does a certain amount of validation while the WU is being computed, [If a WU is unstable, it's good to cancel the remainder of that assignment] but FAH does additional validation steps once the WU is uploaded and a provisional WU may be discarded before it's considered as scientifically valid. Routine duplication of WU processing is considered an unnecessary waste of Donor resources.

Re: Do you need help?

Posted: Fri Apr 10, 2020 2:05 am
by UofM.MartinK
PantherX wrote:
UofM.MartinK wrote:I think they also mentioned that they do several runs with the same starting configuration, and those will run with slightly differences if the RNG is seeded differently each time. So they might, say, run the same WU 5-10 times, and comparing the results in the end serves as a validation, but more importantly it gives a statistical variation as a function of different random numbers (or even different CPUs it ran on, depending which math libraries are used - believe it or not, not all CPUs calculate floating point equally!) in the algorithm...
Just to clarify when they say "differences", they mean the following changes (that I know off):
1) The starting point of each atom/molecule will vary randomly
2) There will be different temperatures of the system

Thus, each WU download is unique. Results from one WU will be used to build a chain of WUs which is essentially a trajectory. Here are some other explanations of PRCG:
viewtopic.php?p=315888#p315888
viewtopic.php?p=314895#p314895
These explains the trajectories well, but I was not talking about those.

Some exact copies of start states are used, as Joseph Coffland mentions (starting at 22:00 in the recording of the firedev chat): "a certain number of copies are used to get statistical averages"

(edit: removed unverified assumptions from the next paragraphs)
And that could makes a lot of sense, as I tried to convey, but failed: running the same WU twice might not yield the exact same result if random numbers are involved. If your algorithm is well designed, those results should all be similar to each other - but never the same.

My understanding is that F@H runs simulations "in the time domain", i.e. every WU represents a state the molecule is currently in, the client calculates some more time steps, and then returns the status of the molecule after that time elapsed. Potential differences will more likely add up over time, i.e. if you run the whole sequence based on the same "start state" WU several times, the later WUs (which carry more history) will differ more than the WU derived directly from the original start state. That is, if the result of a run would not be deterministic.

edit: I think not every simulation in F@H is in the "time domain", but that's how they explained the "trajectories" with the "movie" analogy.

So you are right, every WU is truly unique (unless it is re-assigned due to technical difficulties), perhaps with the exception of the original "start state" WU - which might be assigned several times, resulting in slightly different "renderings" of the same "movie" (i.e., trajectory)?

Re: Do you need help?

Posted: Sun Apr 12, 2020 12:37 pm
by knopflerbruce
Would it make any sense reaching out to other universities if the need for more work servers is still increasing? I work at the University of Oslo (in Norway), and I could always ask if they're interested in helping you out here. I don't know if they have the storage you ask for, but who knows?

Also, how good does the hardware have to be? Like, I have an old Areca 1882IX Raid controller, is that kind of stuff relevant in the process of making a good work server? Do you want many smaller SSDs rather than a handful of 15TB ones? CPU/RAM?

(I'm not a storage guy, sorry if the above sounds like I know nothing about IO - this is pretty much the case)

Re: Do you need help?

Posted: Sun Apr 12, 2020 1:35 pm
by UofM.MartinK
knopflerbruce wrote:Would it make any sense reaching out to other universities if the need for more work servers is still increasing?
In the Fire side dev chat last Thursday, it sounded like they have enough work server hardware available and also significant collaboration from the industry (I guess we will hear some interesting announcements in the coming weeks).

The main bottleneck is scientists feeding enough useful work to folding at home...
knopflerbruce wrote: I work at the University of Oslo (in Norway),
That being said - if there is a group at your University which does do (or quickly could get into) protein folding from an academic standpoint, i.e. do the precursor to generate useful Work Units (and later making sense of the result), that would be very helpful (My assumption since I am not associated with the folding at home project at this point). In that case, it _might_ make sense that your institution provides a work server, preferably at the University and with fast internet access.

My understanding is that a decent enterprise-server-class system with 50+ TB decent storage, 32+ cores and LOTS of RAM (128GB+, to buffer all the slowly incoming results) is sufficient for a WS connected with 1 Gbit.

I am not sure if you the Areca card would very useful - Hardware RAIDs are kind of out in that league, just-a-bunch-of-disks (which the 1882IX could do) is preferred with ZFS, but in my experience, a decent HBA works better. SSD buffering (e.g., optane) can nicely be integrated in ZFS as Level2 ARC to improve performance somewhat - although since most data is write-once, read-once (in contrast to a fileserver), it's important for metadata caching but won't help much with the base throughput.

In the end, it boils down to RAM for a simple reason as Joseph Coffland nicely explained on Thursday:

A Workserver, connected with 1 Gbit (i.e. ~120MB/s) receives most of its data from clients on asymmetric home internet connections, so the average upload speed of a client is much smaller for a client than the download speed. And if you receive 50 MB results from 1000 clients simultaneously, you want at least 50 GB of RAM to buffer that without hitting any disk (even SSDs would start to be challenged with that type of access pattern).

Then, the WS generates a new WU from the just received ones while they are still in RAM, that new WU is ideally also nice to be cached in RAM so it doesn't have to be read from slow storage when handed out etc. So the more RAM, the better. If there is enough RAM for primary data buffering and the metadata, the slower storage "just" has to maintain about ~100MB/s of more or less serial writes and reads in regular operation, and as much as it can if the data is bulk-analyzed, transferred etc. (decribed in my own terms to describe what I think is happening, all mainly extrapolated from the fire side dev chat.)

Re: Do you need help?

Posted: Mon Apr 13, 2020 12:11 am
by EternalPainSkylar
My server currently has 2 X5650s, 48GB RAM, and 6x4TB hard drives. I could probably do a bit of upgrading without too much financial pain, or should I just go away and leave it to the pros?

Re: Do you need help?

Posted: Mon Apr 13, 2020 12:19 am
by PantherX
Welcome to the F@H Forum EternalPainSkylar,

Thanks for your offer. It's my understanding that they have plenty of hardware available now for F@H Server Infrastructure. The next step is to get them all on-line and functioning to be assigning out WUs to donors. That's a mix of development and research tasks that are currently underway.

However, if you want, you can fold on that system as is since folding doesn't require a lot of RAM or storage, just CPU and GPU. In your case, 24 CPUs folding would be a nice contribution :)

Re: Do you need help?

Posted: Mon Apr 13, 2020 12:48 am
by EternalPainSkylar
Thanks PantherX!

Looks like I'll leave the infratructure to the pros, probably less stressful for both parties. :lol:
Hope it goes well! Y'all are doing some seriously amazing stuff.

I'll probably dedicate a VM with 20 cores or so to F@H. Already have a 3700X and 2080Ti running, but it never hurts to fold more!

Re: Do you need help?

Posted: Mon Apr 13, 2020 12:59 pm
by UofM.MartinK
Especially since most of the SARS-CoV-2 simulations are CPU at this point, I assume adding CPUs to the mix is more important than GPUs right now.

Re: Do you need help?

Posted: Mon Apr 13, 2020 11:22 pm
by JimF
UofM.MartinK wrote:Especially since most of the SARS-CoV-2 simulations are CPU at this point, I assume adding CPUs to the mix is more important than GPUs right now.
That is useful to know. I have had to give up on the GPUs due to lack of work (six cards). But I have fired up an i7-9700 (8 real cores), a Ryzen 3700X (16 virtual cores) and a Ryzen 3950X (32 virtual cores), all under Ubunt 1804.4. They are all getting work reliably, and it looks like I will average close to a million PPD, which is not important in itself, but they are putting out the work.