I will add more 32 cores nodes this week, so I will let them run without slots (so default 1x32 cores) and share with you the error when it happen, so it can help to debug if needed
Dear _r2w_ben, No, we removed the GPU from our SMP, to use them somewhere else (and also because Nvidia dropped linux driver support on our GPU. Drivers for CUDA can only be found on Windows 10 today, so we also have Windows 10 computes nodes now, alongside our Centos ones :roll: ). It seems FAH alw...
Dear Portella, Many thanks for this feedback. So I will concentrate on slim nodes, then fat nodes, and consider first nodes without interconnect. I will update the topic subject to solved. :) By the way, if it could help someone, we are using an home made open source cluster stack, based on Ansible,...
Dear All, Many thanks for these answers. I will reduce to 16 cores slots, since even our 32 cores nodes face this issue. It seems that FAClient retry indefinitely the same WU when it fails, making randomly nodes stuck in a loop. I will create a cron to detect that and restart FAClient + erase work d...
Hi, We are running F@H on our cluster at the Fabrique du Loch - FabLab, Auray, France, and we have an issue with one of our nodes. While slim nodes (16 cores/32Gb ram, I think Charmm is running on these) run ok, our SMP node (64/32 cores/1Tb ram) faces issues with Gromacs parameters: 14:28:54:WU00:F...