99% CPU usage? Avoid disaster by monitoring the Raspberry Pi

Published by Oliver on

Even the best smart home system does not do you much good if the controller becomes unresponsive. Mine runs on a Raspberry Pi and recently died due to a software issue. To avoid something like happening again I updated my monitoring of the Pi itself. Here is a simple way of monitoring the Raspberry Pi with OpenHab.

The Raspberry Pi – not a cake

The Pi might be the best thing that happened to the DIY community in the last couple of year. It is a very small, relatively powerful yet energy efficient computer. It has been used anywhere from NAS to media center but it is also the center piece of many smart home and automation systems.

If you are thinking about getting a Raspberry Pi too consider buying it via my affiliate link(s). It won’t cost you more and pays for my server costs.
Raspberry Pi 3 kit (I am using the 3 right now)
Raspberry Pi 4 (the newest model, without any extras)

Raspberry Pi 4 power supply
SD card

I am running OpenHab via Openhabian on my Pi, but most other smart home controllers will also run on the Pi. I wrote a little bit more about this setup in my Zigbee2MQTT article. Some time ago I started integrating my Roborock S50 smart vacuum robot into my system. The result is great: I can just tell Alexa to start cleaning or link the robots cleaning schedule to anything else in my home (like presence detection).

Unfortunately the binding used for control of the robot had a serious bug which lead to the creation of an ever growing number of threads. More threads means more CPU usage which means the Pi will get hotter. As it normally runs really cool and sits right in my living room I decided not to add a fan. All of this means that a constant really high CPU usage could shorten the life span of the Pi and will definitely mean the smart home system will stop working.

CPU temperature spikes due to the software bug – restarted between each spike

Fortunately the bug is fixed by now, so you can simply upgrade to the newest OpenHab version (and therefore binding version) to get rid of it. I only noticed the problem when some part of my smart home system stopped working. To be able to see those problems faster and fix them I decided to update my monitoring system for the Raspberry Pi itself.

Data collection

The first step is to get as much data about the Raspberry Pi as possible and to store the important ones for debugging purposes. In OpenHab you can collect data about the system it is running on quite easily by using the Systeminfo Binding. It can be installed via PaperUI as any other binding. Once that is done you need to create Things.

systeminfo:computer:raspi [interval_high=10, interval_medium=60]

Here I define a new “Thing” in Openhab from the systeminfo binding of the type computer with my custom name “raspi”. I also added some configuration controlling the interval for high/medium speed updates of the data. Things like the CPU load will be updated every 10 seconds others like remaining storage capacity only every 60 seconds.

This new “Thing” provides a lot of data via different channels. These can be linked to “Items” in OpenHab, which we can then use anywhere in the system. My items file looks like this:

/* Raspi infos */

/* Network information*/
Number Network_DataSent           "Data sent"           <flowpipe>       { channel="systeminfo:computer:raspi:network#dataSent" }
Number Network_DataReceived       "Data received"       <returnpipe>     { channel="systeminfo:computer:raspi:network#dataReceived" }

/* CPU information*/
Number CPU_Load1                  "Load (1 min)"        <none> (grSensor) { channel="systeminfo:computer:raspi:cpu#load1" }
Number CPU_Load5                  "Load (5 min)"        <none>           { channel="systeminfo:computer:raspi:cpu#load5" }
Number CPU_Load15                 "Load (15 min)"       <none>           { channel="systeminfo:computer:raspi:cpu#load15" }
Number CPU_Uptime                 "Uptime"              <time>           { channel="systeminfo:computer:raspi:cpu#uptime" }

/* Storage information*/
String Storage_Name               "Name"                <none>           { channel="systeminfo:computer:raspi:storage#name" }
Number Storage_Available          "Available"           <none>           { channel="systeminfo:computer:raspi:storage#available" }
Number Storage_Used               "Used"                <none>           { channel="systeminfo:computer:raspi:storage#used" }
Number Storage_Total              "Total"               <none>           { channel="systeminfo:computer:raspi:storage#total" }
Number Storage_Available_Percent  "Available (%)"       <none>           { channel="systeminfo:computer:raspi:storage#availablePercent" }
Number Storage_Used_Percent       "Used (%)"            <none>           { channel="systeminfo:computer:raspi:storage#usedPercent" }

/* Memory information*/
Number Memory_Available           "Available"           <none> (grSensor)       { channel="systeminfo:computer:raspi:memory#available" }
Number Memory_Used                "Used"                <none>                  { channel="systeminfo:computer:raspi:memory#used" }
Number Memory_Total               "Total"               <none>                  { channel="systeminfo:computer:raspi:memory#total" }
Number Memory_Available_Percent   "Available (%)"       <none> (grSensor)       { channel="systeminfo:computer:raspi:memory#availablePercent" }
Number Memory_Used_Percent        "Used (%)"            <none> (grSensor)       { channel="systeminfo:computer:raspi:memory#usedPercent" }

/* Sensors information*/
Number Sensor_CPUTemp             "CPU Temperature"     <temperature>  (grSensor)  { channel="systeminfo:computer:raspi:sensors#cpuTemp" }
Number Sensor_CPUVoltage          "CPU Voltage"         <energy>         { channel="systeminfo:computer:raspi:sensors#cpuVoltage" }

As you can see a lot of information about network, CPU, storage, memory and temperature is available. Most of it is self explanatory, only the CPU usage was not entirely logical to me when I first saw it. Apparently due to a change in the underlying library the binding can’t detect the current CPU usage anymore, instead it will provide averages over the last 1/5/15 minutes. To be honest that might even be more important.

All of this data is interesting, but the items I thought of having most impact on how stable the Pi behaves have been put into the “grSensors” group which automatically gets saved to my database. I described how that is done in my post about my battery state warning system. This information include the most current CPU load, memory consumption and CPU temperature. Available storage might also be a concern, but I am not really adding much to my Pi anymore and there is enough storage.

Keeping an eye on the data – monitoring the Raspberry Pi

Now that we have this data we can also display it. I added this to my sitemap.

Frame label="Rasperry Pi" {
    Text item=CPU_Load1
    Text item=Sensor_CPUTemp
    Text item=Memory_Available_Percent
    Text item=Memory_Used_Percent

This will show me a quick overview with the CPU load, the temperature and the available memory. Nice 🙂

Sitemap showing resource usage of the Raspberry Pi

This looks good enough already but there is a simple way to display even more data even nicer: Grafana. Again I described the general setup already in my post about the battery level warning system.

I added a graph for the CPU temperature as well as two color changing “singlestats” for RAM and CPU usage. This will allow me to see any unusual behavior on first glance. More details on how to create such a full dashboard in Grafana can be found in my guide.

Monitoring the Raspberry Pi in Grafana showing RAM, CPU and temperatures
My Grafana dashboard with data about my Raspberry Pi

If you are not regularly watching the dashboard it would also be trivial to add a rule in OpenHab that sends you a message if one of these values exceeds a certain threshold. Again a similar rule is already explained in the article about my water leak sensor.

Categories: Openhab