HOWTO: Linux Dashboard to monitor GPU, CPU temps and more
Posted: Wed Jun 24, 2020 5:43 pm
Overview
Dashboarding for Linux Mint 19, Ubuntu 18.04 with NVIDIA GPUs. Also now tested on Mint 18/Ubuntu 16. "TIG" stack can be installed on Windows too but this guide does not cover it.
Using widely used open source tooling I'll explain how to setup and configure your Linux folding system to record system metrics and graph into easily understandable dashboards available via your browser.
NOTE: Guide not tested on any other version of Linux other than those stated.
I'm hoping this will be useful for the wider community as since lock down I've a lot of spare time late in the evenings have learnt how to set this up and monitor my folding system. Hopefully others will find it useful.
If it looks complex it really isn't. 20+ commands from command line and about 15 minutes of work end to end to install the 3 products and configure from step 1 to 4. It can take some time to configure your dashboard how you want it but 15 minutes to get your first simple dashboard.
NOTE: Support for AMD GPUs looks weak in Telegraf. When I setup on Windows PC with AMD GPU I could not see how to enable GPU metric collecton. Metrics are captured via "plugins" on Telegraf so possible that'll change for AMD if plugin update released?
Our folding systems run hot and push the limits of the components so it makes sense to monitor they are running within what we deem acceptable limits. Those will vary per person and system but if you don't know what your system is doing how can you make a judgement? And who really loves running multiple commands to get info when you can see it all on one dashboard. With a dashboard you can track what's happening now, what happened a few hours ago or weeks and months ago.
Also if you install Telegraf on each server you can configure that to point to a single InfluxDB and have all your servers monitored via a single screen. Although I don't cover that on this guide you can have all your servers on a single screen or have them selectable via a drop down.
There are various options but I'll document the "TIG" stack which is Telegraf, InfluxDB and Grafana. There are alternatives - "TICK" stack for example is Telegraf, InfluxDB, Chronograf and Kapacitor. In TICK Chronograf is your dashboarding software and an alternative for Grafana. Kapacitor is alerting should you breach thresholds eg: temperature. Alerting is available in Grafana and plenty sufficient for my needs. Also.... we use Grafana at work so an easy decision.
Telegraf : An agent for collecting, processing, aggregating, and writing metrics
InfluxDB : An open-source time series database
Grafana : Dashboarding suite
Flow:
Telegraf collects the stats and stores timestamped metrics in InfluxDB and Grafana plots, graphs from InfluxDB source.
Install:
Setup InfluxDB first as required for Telegraf configuration. Grafana last once Telegraf feed to InfluxDB setup.
Dashboarding for Linux Mint 19, Ubuntu 18.04 with NVIDIA GPUs. Also now tested on Mint 18/Ubuntu 16. "TIG" stack can be installed on Windows too but this guide does not cover it.
Using widely used open source tooling I'll explain how to setup and configure your Linux folding system to record system metrics and graph into easily understandable dashboards available via your browser.
NOTE: Guide not tested on any other version of Linux other than those stated.
I'm hoping this will be useful for the wider community as since lock down I've a lot of spare time late in the evenings have learnt how to set this up and monitor my folding system. Hopefully others will find it useful.
If it looks complex it really isn't. 20+ commands from command line and about 15 minutes of work end to end to install the 3 products and configure from step 1 to 4. It can take some time to configure your dashboard how you want it but 15 minutes to get your first simple dashboard.
NOTE: Support for AMD GPUs looks weak in Telegraf. When I setup on Windows PC with AMD GPU I could not see how to enable GPU metric collecton. Metrics are captured via "plugins" on Telegraf so possible that'll change for AMD if plugin update released?
Our folding systems run hot and push the limits of the components so it makes sense to monitor they are running within what we deem acceptable limits. Those will vary per person and system but if you don't know what your system is doing how can you make a judgement? And who really loves running multiple commands to get info when you can see it all on one dashboard. With a dashboard you can track what's happening now, what happened a few hours ago or weeks and months ago.
Also if you install Telegraf on each server you can configure that to point to a single InfluxDB and have all your servers monitored via a single screen. Although I don't cover that on this guide you can have all your servers on a single screen or have them selectable via a drop down.
There are various options but I'll document the "TIG" stack which is Telegraf, InfluxDB and Grafana. There are alternatives - "TICK" stack for example is Telegraf, InfluxDB, Chronograf and Kapacitor. In TICK Chronograf is your dashboarding software and an alternative for Grafana. Kapacitor is alerting should you breach thresholds eg: temperature. Alerting is available in Grafana and plenty sufficient for my needs. Also.... we use Grafana at work so an easy decision.
Telegraf : An agent for collecting, processing, aggregating, and writing metrics
InfluxDB : An open-source time series database
Grafana : Dashboarding suite
Flow:
Telegraf collects the stats and stores timestamped metrics in InfluxDB and Grafana plots, graphs from InfluxDB source.
Install:
Setup InfluxDB first as required for Telegraf configuration. Grafana last once Telegraf feed to InfluxDB setup.