This message was deleted Platform Engineering #kubernetes

Join Slack

This message was deleted.

# kubernetes

Slackbot

09/07/2023, 2:24 PM

This message was deleted.

Alexandre Pauwels

09/07/2023, 2:29 PM

Standard tooling for anything k8s is to use prometheus as the scraping agent, and then some sort of tool to ingest the prometheus data, index it for you, and use it as a dashboard. In our case, we're using the lightweight grafana-agent which is just prometheus but lighter, and then shipping metrics data to grafana mimir, log data to grafana loki, and trace date to grafana tempo. We then use the grafana UI to do all of our visualizations across all three pillars. The advantage of doing this is that grafana doesn't use a complex data store like elasticsearch or cassandra to index, it simply shoves everything in S3. This keeps storage costs very low and allows us to leverage automated backup from AWS. If you have any questions about this tack let me know.

Tharun Kumar Allu

09/07/2023, 2:36 PM

Thanks @Alexandre Pauwels , we do have standard monitoring setup already using prometheus + grafana and also use Opensearch for logs, my question was more on the lines of understanding on what to check for. The following are the example scenarios I would like to be on top of 1. Update Kubermetes api before they are removed, need means to find them, one tool is

kubent

that gives you the necessary details. 2. Given few people doing other activities something that emails/slacks failed nodes/pods etc 3. And any other things that this group does to keep their clusters and apps happy

Alexandre Pauwels

09/07/2023, 2:40 PM

Ah gotcha. #1 is interesting, I'm usually manually reading the release docs and then grep'ing for deprecated APIs in my gitops repo lol. Is kubeent able to output prometheus-like metrics that could be scraped? For point 2, I believe grafana/prometheus has alertmanager that can ping slack based on indexed data.

Tharun Kumar Allu

09/07/2023, 2:42 PM

not sure recently found about

kubent

when reading documents alone was not sufficient for doing a k8s upgrade, https://github.com/doitintl/kube-no-trouble is the project

Alexandre Pauwels

09/07/2023, 2:44 PM

Doesn't look like it does, but seems like it wouldn't be too tough to wrap the golang in a prometheus server, ingest the output of the tool, and output them as metrics. Could be a good PR as well.

Troy Knapp

09/07/2023, 4:48 PM

I really find that Prometheus/Grafana comes with a big maintenance cost. I prefer Data Dog or New Relic.

Tharun Kumar Allu

09/07/2023, 4:49 PM

Pay your Engineers or Pay your vendor 🙂

Troy Knapp

09/07/2023, 4:55 PM

Yep, and hiring more engineers for a small team isn't always an option

Tharun Kumar Allu

09/07/2023, 5:03 PM

yes absolutely. This is a calculation we have to make when growing. What is the right time and cost that would allow us to grow the team.

Troy Knapp

09/07/2023, 5:43 PM

Kubent is helpful when upgrading and you hit a snag especially with vendor charts. For the most part I just pay attention and upgrade the helm charts my devs use to be compatible with API changes. If you're using Azure they do a really good job about warning you about broken API changes... They just don't do a great job figuring out which charts are fucked. Now, hypothetically, let's say you go over all your charts and you still get that error on upgrade from Azure and you've got a bunch of clusters that you need to search for incompatible apis in previous helm configs... you might be tempted to write a monster like this https://gist.github.com/ImIOImI/8df66496548a95aa4b75bc62b7ffa446

Troy Knapp

09/07/2023, 5:44 PM

I'm pretty sure that script is brittle and shouldn't be trusted, but it illustrates the basic workflow that kubent can't always do

Michael Montagna

09/11/2023, 3:12 AM

I used the prom stack (prom, grafana, loki, plus fluentd/fluentbit). the operators are great (except for grafana's). The cluster runs itself once tuned.

Troy Knapp

09/11/2023, 12:02 PM

The biggest problem, imo, is keeping Prometheus tuned. If you fuck it up you start missing logs which is unacceptable.

Nočnica Mellifera

09/15/2023, 3:44 AM

Alternatives to Prometheus/Grafana that don't end up costing you a mint like Datadog and New Relic: Honeycomb (best for deep trace analysis) SigNoz (hosted on your own cluster, gets all 3 signals) TelemetryHub (SaaS but very cheap)

Open in Slack

Previous Next