Hi there, trying to get some feedback regarding ha...
# general
a
Hi there, trying to get some feedback regarding hardware observability. Does anyone here monitor data center infrastructure? How do you know when the infrastructure is down due to a fault in the physical device? Do you use any of the observability tools like Datadog, Dynatrace, Splunk or New Relic. OR something like Solarwinds, Nagios and Zabbix.
m
can you give us a little more background, as i'm new to these tools
s
I would change perspectives a bit: What you usually care about most and why you want to page people in the middle of the night is when the service/purpose goes down or is impaired. So have one layer monitoring that with paging and management dashboards. If you are designing robust or recoverable solutionss, you expect things to fail and provide redundancy and/or automatic recovery. Failure of an individual physical device should be expected and an issue to address in the next normal workday. Then you need the tools/skills to investigate & isolate these situations, which may already be present. Especially if you understand how your environment breaks down into smaller subsystems. Most of the tools you mention can help, hopefully they give you the ability to look at a system so complex people can't fully understand it and then find the problem. But to do that they introduce their own layers of abstraction that you have to setup and then later dig into to find where the problem is located. But then you often end up going back to the previous paragraph as the next step.