Hi All, I would like to monitor few k8s clusters w...
# kubernetes
t
Hi All, I would like to monitor few k8s clusters with a very small team, I would like to know what tools processes the community is using/following and what things are they looking for day to day and in other frequencies. Want to also look at upgrading and looking for deprecated apis etc.
a
Standard tooling for anything k8s is to use prometheus as the scraping agent, and then some sort of tool to ingest the prometheus data, index it for you, and use it as a dashboard. In our case, we're using the lightweight grafana-agent which is just prometheus but lighter, and then shipping metrics data to grafana mimir, log data to grafana loki, and trace date to grafana tempo. We then use the grafana UI to do all of our visualizations across all three pillars. The advantage of doing this is that grafana doesn't use a complex data store like elasticsearch or cassandra to index, it simply shoves everything in S3. This keeps storage costs very low and allows us to leverage automated backup from AWS. If you have any questions about this tack let me know.
t
Thanks @Alexandre Pauwels , we do have standard monitoring setup already using prometheus + grafana and also use Opensearch for logs, my question was more on the lines of understanding on what to check for. The following are the example scenarios I would like to be on top of 1. Update Kubermetes api before they are removed, need means to find them, one tool is
kubent
that gives you the necessary details. 2. Given few people doing other activities something that emails/slacks failed nodes/pods etc 3. And any other things that this group does to keep their clusters and apps happy
a
Ah gotcha. #1 is interesting, I'm usually manually reading the release docs and then grep'ing for deprecated APIs in my gitops repo lol. Is kubeent able to output prometheus-like metrics that could be scraped? For point 2, I believe grafana/prometheus has alertmanager that can ping slack based on indexed data.
t
not sure recently found about
kubent
when reading documents alone was not sufficient for doing a k8s upgrade, https://github.com/doitintl/kube-no-trouble is the project
a
Doesn't look like it does, but seems like it wouldn't be too tough to wrap the golang in a prometheus server, ingest the output of the tool, and output them as metrics. Could be a good PR as well.
t
I really find that Prometheus/Grafana comes with a big maintenance cost. I prefer Data Dog or New Relic.
t
Pay your Engineers or Pay your vendor 🙂
t
Yep, and hiring more engineers for a small team isn't always an option
t
yes absolutely. This is a calculation we have to make when growing. What is the right time and cost that would allow us to grow the team.
t
Kubent is helpful when upgrading and you hit a snag especially with vendor charts. For the most part I just pay attention and upgrade the helm charts my devs use to be compatible with API changes. If you're using Azure they do a really good job about warning you about broken API changes... They just don't do a great job figuring out which charts are fucked. Now, hypothetically, let's say you go over all your charts and you still get that error on upgrade from Azure and you've got a bunch of clusters that you need to search for incompatible apis in previous helm configs... you might be tempted to write a monster like this https://gist.github.com/ImIOImI/8df66496548a95aa4b75bc62b7ffa446
I'm pretty sure that script is brittle and shouldn't be trusted, but it illustrates the basic workflow that kubent can't always do
m
I used the prom stack (prom, grafana, loki, plus fluentd/fluentbit). the operators are great (except for grafana's). The cluster runs itself once tuned.
t
The biggest problem, imo, is keeping Prometheus tuned. If you fuck it up you start missing logs which is unacceptable.
n
Alternatives to Prometheus/Grafana that don't end up costing you a mint like Datadog and New Relic: Honeycomb (best for deep trace analysis) SigNoz (hosted on your own cluster, gets all 3 signals) TelemetryHub (SaaS but very cheap)