Considering the observability needs and the pillar...
# observability
g
Considering the observability needs and the pillars of a reliable system (monitoring, tracing, logs, and metrics), how do DataDog and the Elastic Stack compare? Which tool would you choose and why?
a
IMO the answer to that hugely depends on how many Infra/SRE folks you have at your disposal to maintain the Elastic Stack, versus the budget to solve problems without adding headcount
g
Today my structure is very small, I recommend another solution today we have all GCP projects
d
DataDog provides built-in monitoring, tracing, and log management, making it ideal for teams seeking fast setup and minimal maintenance. ELK, on the other hand, offers more flexibility and control for those who need a customizable, self-managed platform. It scales well for log aggregation and search, with Elastic APM handling distributed tracing. I personally prefer and use ELK, but if you need quick setup and out-of-the-box features, go with DataDog. If long-term cost-efficiency and control are priorities, ELK is the better choice.
a
Very well put! I usually gauge it on (some of) the following factors • Can we afford to pay for Datadog? (justification versus full-time-employee headcount) • Is part/most of my job (or someone else's job) to provide/support observability for feature teams? • How quickly do we need a comprehensive solution? • How business critical is it that the observability stack remains reliable and available? In most cases my or my team's job is to provide other capabilities to feature teams, so I'll often prefer a managed service like Datadog so we don't have to worry about it. But as Dimitris explained, if you need to be cost-conscious, need long-term commitment or granular control, ELK or Prometheus/Grafana are usually better choices
Datadog can get freaking expensive, and OSS solutions can be just as (if not more) scalable/resilient if architected well, but I wouldn't underestimate the time/skill investment that may take depending on your infrastructure/growth/etc.
d
Great points, I completely agree. Speed of delivery is key, especially in startups, which is why tools like DataDog can be so appealing for their quick setup. But as you mentioned, the costs can add up quickly. For us, ELK has been a better long-term choice, offering flexibility and cost control, even though it requires more effort upfront. As a DevOps guy, the setup wasn’t a big deal for me, but even in startups, sooner or later, you'll need to build a DevOps team, and maintaining observability will become part of their daily routine anyway.
g
Based on the points we discussed, I decided to go with DataDog. As we are implementing the SRE team now, it's crucial for me to ensure efficiency and credibility in our operations. I hope this choice will help us achieve our goals. Thank you again for your help, and I hope we can continue exchanging ideas in the future!