Hey, what are good recommendations for availabilit...
# general
w
Hey, what are good recommendations for availability monitoring tools such as uptrends, pingdom, etc, regarding supported checks, modernity, setup, alerting?
a
If you want something simple, I wrote a shell script a while ago. I just rely on ssh, so if a host goes down, you’d wait for tcp tear down (or only timeout if you’re lucky). But it works fine for showing tons of Prometheus metrics etc. https://github.com/ag4ve/misc-scripts/blob/master/mon-hosts-packed
w
Thank you @ag4ve - it will not solve my business need but it will certainly serve me to improve my learning. I appreciate it.
Not tools but more practices - obviously otel is a thing and some frameworks for sli, but as much how to think about alerting as anything
w
Thanks! Yes I am increasingly adopting those practices on a new team, just looking for some tool to perform some basic status check (like NAGIOS but more modern).
a
When I was asked to look into it, I came to: Prometheus - basic grafana (there are some metrics databases that look really cool and also kinda new) and fluent to pickup for normal logging/long term storage (elastic or greylog or splunk or whatever). Everything has an alerting feature that you get to tune. If you know about SLOs (especially with that Google doc and others as background), I’d be really curious to know what’s more advanced than that too.
Oh I will mention that last time I looked into this, we started by diving into grafana. I’ve talked to people at greylog who admit that grafana has the prettiest dashboards - so maybe you use it for that. But maybe don’t start there. I’d start at the data collection and storage angle instead - even if you get a paid solution that’s what you end up paying for. I’ve seen signoz mentioned quite a bit recently (there are a number of others) - or if you can pay maybe datadog (basically Splunk prices - per node vs per gb iirc). But I haven’t gone this route - I looked into doing this at a place that had logs all over the place and they weren’t ready to do telemetry (imo) and we didn’t get past “pretty grafana dashboards”. So that’s my thoughts on: if I had a decent level of centralized logging and had to do it again, that’s where I’d go.
w
hey @ag4ve - yes, thanks. I am using Grafana / Prometheus stack for most of the monitoring. It has it's limitations, for example to test HTTP endpoints (for example a service /healthcheck) for some enterprise grade solution there is room for many types of monitoring, such as Datadog for APM, Grafana for metrics, and something else I am looking for to monitor from an external perspective (so it sees what users sees) that only checks if https endpoints are up, healthy and reachable.
a
You might look at pingdom if you want external checks (solarwinds would love your support given their sec issues :P )
w
Will do, thanks for being supportive!