Prometheus metrics saves us from painful kernel debugging (2022)

The author shares a recent experience where their servers faced constant out-of-memory kills after a kernel upgrade, with no clear cause. Through detailed analysis of their Prometheus host metrics via Grafana dashboards, they discovered a steadily increasing slab memory usage, leading them to suspect a kernel command line change disabling AppArmor was the trigger for the memory leak. By reverting the change and scheduling server reboots, they avoided a potentially disastrous situation. The author highlights the critical role their Prometheus and Grafana setup played in quickly identifying and resolving the issue, emphasizing the importance of having robust monitoring systems in place.

https://utcc.utoronto.ca/~cks/space/blog/sysadmin/PrometheusHostMetricsSaveUs