Fixing Docker stats and Cadvisor CPU usage (on a Raspberry Pi)

Published by Oliver on

Prometheus works great for collecting server stats on a Raspberry Pi but was missing some information on a Raspberry Pi in docker stats and the cadvisor used more CPU than the containers it was monitoring. Here is how fixed those issues.

Cleaning up & fixing the downloaded dashboards

When I set up my monitoring via Prometheus in my last post it was working mostly well but still had some issues. The first one was that a bunch of metrics were missing in dashboards I downloaded, like CPU and network info in the image below.

cpu information is missing in the dashboard
Missing info for CPU and some other stats

For the most part this was not caused by actual missing data but instead by the “PromQL” (Prometheus querying language) statement used to get the data. I do not know if these dashboards were build for Kubernetes or some similar tool instead of pure Docker(-compose) but they mostly contained a filter for instance="$node" which seems not to be set by default.

Simply removing that node related code from the query fixed those graphs for me. For example some_metric_collected{instance="$node",job="$job"} becomes some_metric_collected{job="$job"}. This can be done by clicking the header of each diagram and selecting the edit option.

Upgrading to new panel types is easy

Finally I also removed some panels I am not really using and upgraded the remaining ones. Most were still using the old time series and can be upgraded with one click in the edit mode to the new time series. No major changes but it just looks a bit nicer.

Legend added to show some more important values and enable sorting

I also added some values (mean for CPU usage for example and last non-null value for RAM usage) to the legend. This allows you to easily sort all containers by their memory or CPU usage.

Fixing memory monitoring on a Raspberry Pi

Even after these updates my dashboard was somehow still missing data about the memory usage on my Raspberry Pi, at least for the Docker containers. It seemed like Prometheus had crashed a couple of times due to out of memory issues so I really wanted to have those numbers. A quick check using docker stats showed … nothing really.

The stats command shows CPU and RAM usage of your containers but in my case it was only showing 0 for everything memory related. After some quick googling I checked docker info and that was printing a couple of warning at the end including these about memory:

$ docker info
...
WARNING: No swap limit support  
WARNING: No memory limit support  

It seems like by default some Control Groups related settings are not enabled (on older OS versions in general but also) on the Raspberry Pi. I am using Raspbian 10 (buster) right now (you can check using cat /etc/os-release) and was able to fix this problem by updating the cmdline file like this:

sudo nano /boot/cmdline.txt
// add at the end of the line:
cgroup_enable=cpuset cgroup_enable=memory cgroup_memory=1
// save file

// will fully restart
sudo reboot

// check
docker stats

After adding these arguments from this very helpful Github issue and restarting docker info was now not showing any warnings anymore, docker stats was showing memory usage and my dashboards were starting to show container memory usage.

cadvisor started sending memory information
After the update memory usage is now displayed

Cadvisor is eating my CPU

After now having all metrics available and data from a couple of days I pretty quickly noticed one alarming trend: the cadvisor container was using more CPU than all the containers it was monitoring combined. Not great for simple monitoring…

cadvisor is eating my CPU with >28% usage
Incredibly high CPU usage by the cadvisor container

Fortunately I also found a fix for this, altough this really seems like something to be fixed by the developers. Beside switching to the newer (and I think still maintained) build zcube/cadvisor for the Raspberry Pi I also added a couple of commands to reduce CPU usage. Docker_only (restricts it to purely docker containers I think) and housekeeping_interval (minimal update interval for data about containers) did reduce the CPU usage by a lot. I also disabled some metrics I am not really using.

 cadvisor:
    image: zcube/cadvisor:latest # switched image
    container_name: cadvisor
    command:
      - "--docker_only=true"
      - "--housekeeping_interval=30s"
      - "--disable_metrics=accelerator,cpu_topology,disk,memory_numa,tcp,udp,percpu,sched,process,hugetlb,referenced_memory,resctrl,cpuset,advtcp"
# ...

Overall this leads to a much reduced CPU usage! From around 28% on the Pi 4 it went down first to 8% and then when I also excluded those metrics to around 4% now. Much better! While doing this I also switched the node exporter to prom/node-exporter but this had no visible impact.

Reduced CPU usage after the first two changes

This was also fixing some warnings I was seeing in the logs. Especially excluding hugetlb helped I think. My RAM usage has now also severely decreased, although the restart might have helped here. I will keep an eye on this!

I will definitely keep an eye on the server metrics but so far it has been very stable for me after those changes. I also like my updated dashboard. Check this out if you want to build your own dashboards too!

Docker & server metrics dashboard