bruce - great points. On the N vs 2N I was thinking about that a bit - it seems to me that the data analysis time is say a step function of the data amount - i.e. I can probably go up to 5N, maybe even 10N and still spend the same amount of time analyzing - the CPU time needed to build the models increases but those are big jobs anyway and I'd already run them overnight - it's mostly about 'the approach' (e.g. extensive hyperparameter validation we started doing recently for the multiple ms datasets we have now).
So like you say, we have to decide how far to take a project out - and that's also not easy - I have some heuristics but you have to build a model to really know. Some projects we have to then run longer, some we already would have collected more than we needed to make a satisfactory quality model (but more data is always better models). This is also where adaptive sampling helps - in order to do it, you have to have automated analysis in place which decides if you've collected enough and collects more if need be - this should let us manage the computational effort better across projects.
I'm not sure if the analysis has to be ever distributed actually - these are CPU-only jobs, they're not longer than a few days max. on tens of CPUs on the cluster, and this is with ms datasets that we will shrink down with adaptive. As for the AI - we're nowhere near the stage of worrying about how to practically do it - development is still at the theoretical level (e.g.
https://www.nature.com/articles/s41467-017-02388-1). My hope is we'll run on a fully automated classical MSM setup for a couple of years, then move to a neural net version when it's robust. I guess at that stage we can start using GPUs, so perhaps a distributed analysis opportunity sits there. But again - we have a very strong cluster already.