How do people automate MTTR (recovery) if you've d...
# observability
m
How do people automate MTTR (recovery) if you've done that? understanding that calculating it could be different at different companies.
a
automate recovery or collecting of the data?
@JJ Tang this is all your wheelhouse 🙂
m
collecting the data. We're looking at using some data from PagerDuty, but wanted to get other folks views on it.
a
I built a tool a long time ago that exported their data and then built graphs off it. At this point I would just dump their data into whatever BI tools you use. I haven’t tried their analytics interface. The problem was always that we wanted to group events so we built rules around certain pages being cascading/dependent (some explicit some heuristics)
rootly.com or the other incident management tools probably make it easier.
@Matt Wilson looks like PD just acquired Jeli so your problem is going to get way easier it looks like
m
Really appreciate the engagement @Andrew Fong!
a
@Matt Wilson, I would highly recommend considering the Grafana and Prometheus stack for automating MTTR (Mean Time To Recovery) in your projects. This combination has proven to be highly effective in various projects, and it comes with a wealth of community contributions and strong support. The flexibility and capabilities offered by Grafana and Prometheus make them a reliable choice for monitoring, alerting, and incident response automation, ultimately helping to streamline the recovery process and reduce downtime.
j
Thanks for the plug @Andrew Fong! Hey @Matt Wilson — this is a perfect use case for us at Rootly.com. We are helping Figma, LinkedIn, and 100s more do this. Want to drop me a line jj@rootly.com and I’ll take you through it personally? (I am the CEO here!) And yes I realize how late I was here!
m
Nice to meet you JJ - and congrats on your Forbes 30 under 30 selection!