Hi folk, I'm looking for a bit of advice on monito...
# kubernetes
j
Hi folk, I'm looking for a bit of advice on monitoring the availability of my asp.net core microservices deployed into Kubernetes. The first basic question I am trying to answer, "Is the microservice running?". We use App Insights and the standard availability monitors won't work as our microservices are in a private network. I then tried to use a HealthCheck Publisher to publish availability data to App Insights. This failed as "no data" does not trigger an alert. I am now staring down the line of the Prometheus/Grafana train but this looks like another massive learning curve. What are people using to answer the simple question, "Is my app running?" when using Kubernetes.
t
We use Data Dog and a combination of probes and monitors. You've got the standard liveness and readiness probes, then you create monitors and alarms for when they start to fail. You can test side effects like changes in error rates, 500 status rates increasing, etc. Regardless, data about your service is just data until you hook up monitors and alarms
j
I have the K8s probes configured. Are you able to share what alerts you use that target the probes specifically?
m
Check out https://canarychecker.io/ - it is dependency free and has a nice RAG dashboard out of the box - The https://canarychecker.io/reference/kubernetes check will verify if Kubernetes thinks a pod is healthy (based on liveness/readiness probes and/or deployment failures) Canary checker is 100% OSS, and is embedded in our commercial IDP that adds health tracking on every resource by default, change tracking, notifications, playbooks, etc.. Disclaimer - I am the founder of flanksource building these products
t
@Jacob Hodges in general alerting should be centered around the golden signals (you should read that online SRE book, its a really good intro to best practices for monitoring). Honestly, once you hook up Prometheus and use Grafana's default panes of glass for k8s, it'll become pretty apparent on what to build monitors for. Some very specific to k8s conditions that I monitor for (besides the normal alerts mentioned above) are: • Crash backoff loops. • No available node for a pod (indicates a capacity issue) • A deployment with no ready pods
Keep in mind all the monitoring in the world is meaningless without the actual notifications that get to you. I prefer Slack as my first line of communication, when things are suspicious but not dangerous. For alerts like no ready pod on a deployment that are affecting prod I go way more scorched earth. I automatically execute a workflow that creates an incident, creates a channel in Slack for communication about said incident, adds the relevant on call SMEs to that channel, and notifies them via Pager Duty.
a
In addition to the advice above, we also use Gatus to check everything else in-between the internet/VPN and the Pods just as a sanity check in addition to the built-in liveliness/readiness and Grafana (since Pods being healthy doesn't necessarily mean they're routable if component X is malfunctioning in-between)
t
I always encourage my devs to include tests for all the things that makes their app work in their readiness checks (like ability to reach the DB). A readiness check, imo, isn't about whether or not their component is working, it's a check to see whether or not it's capable of receiving traffic. Imagine I change the password on a DB and update a secret holding those credentials on the cluster. If the pods report as healthy (even though they can't get to the DB) k8s still routes traffic to them. However, if they report as unhealthy for long enough k8s automatically shifts traffic to newer pods that have the updated secret and k8s will eventually restart unhealthy pods and automatically repair the issue.
Liveness probes and readiness probes are very different and should (are) treated very differently. If a readiness probe fails the default is "let them cook!" If a liveness probe fails its a short cut to terminationville.
j
At the moment our liveness and readiness probes are just dump checks, they just return 200. We do have logic in the startup check to make sure that we can connect to the appropriate resources. Got the idea from https://andrewlock.net/deploying-asp-net-core-applications-to-kubernetes-part-6-adding-health-checks-with-liveness-readiness-and-startup-probes/.