This message was deleted Platform Engineering #general

Join Slack

This message was deleted.

# general

Slackbot

10/30/2023, 8:18 AM

This message was deleted.

ag4ve

10/30/2023, 2:19 PM

If you want something simple, I wrote a shell script a while ago. I just rely on ssh, so if a host goes down, you’d wait for tcp tear down (or only timeout if you’re lucky). But it works fine for showing tons of Prometheus metrics etc. https://github.com/ag4ve/misc-scripts/blob/master/mon-hosts-packed

Webert Lima

10/30/2023, 2:23 PM

Thank you @ag4ve - it will not solve my business need but it will certainly serve me to improve my learning. I appreciate it.

ag4ve

10/30/2023, 2:25 PM

There’s also this: https://cloud.google.com/blog/products/devops-sre/sre-fundamentals-slis-slas-and-slos

ag4ve

10/30/2023, 2:28 PM

Not tools but more practices - obviously otel is a thing and some frameworks for sli, but as much how to think about alerting as anything

Webert Lima

10/30/2023, 2:32 PM

Thanks! Yes I am increasingly adopting those practices on a new team, just looking for some tool to perform some basic status check (like NAGIOS but more modern).

ag4ve

10/30/2023, 2:50 PM

When I was asked to look into it, I came to: Prometheus - basic grafana (there are some metrics databases that look really cool and also kinda new) and fluent to pickup for normal logging/long term storage (elastic or greylog or splunk or whatever). Everything has an alerting feature that you get to tune. If you know about SLOs (especially with that Google doc and others as background), I’d be really curious to know what’s more advanced than that too.

ag4ve

11/02/2023, 3:05 PM

Oh I will mention that last time I looked into this, we started by diving into grafana. I’ve talked to people at greylog who admit that grafana has the prettiest dashboards - so maybe you use it for that. But maybe don’t start there. I’d start at the data collection and storage angle instead - even if you get a paid solution that’s what you end up paying for. I’ve seen signoz mentioned quite a bit recently (there are a number of others) - or if you can pay maybe datadog (basically Splunk prices - per node vs per gb iirc). But I haven’t gone this route - I looked into doing this at a place that had logs all over the place and they weren’t ready to do telemetry (imo) and we didn’t get past “pretty grafana dashboards”. So that’s my thoughts on: if I had a decent level of centralized logging and had to do it again, that’s where I’d go.

Webert Lima

11/02/2023, 3:24 PM

hey @ag4ve - yes, thanks. I am using Grafana / Prometheus stack for most of the monitoring. It has it's limitations, for example to test HTTP endpoints (for example a service /healthcheck) for some enterprise grade solution there is room for many types of monitoring, such as Datadog for APM, Grafana for metrics, and something else I am looking for to monitor from an external perspective (so it sees what users sees) that only checks if https endpoints are up, healthy and reachable.

ag4ve

11/02/2023, 3:50 PM

You might look at pingdom if you want external checks (solarwinds would love your support given their sec issues :P )

Webert Lima

11/03/2023, 9:24 AM

Will do, thanks for being supportive!

2 Views

Open in Slack

Previous Next