Hey all, anyone have a good on-call policy they'd ...
# platform-leadership
Hey all, anyone have a good on-call policy they'd be willing to share with me? I'm looking to start one for my small platform team
Are you looking for policy language ("Thou shalt have a responder to thy outages ..") or an implementation template ("Oncall is a X day rotation starting on A and ending on B")?
Both would be helpful, as I'm starting from (mostly) nothing. We have a standard incident response policy and SLAs based on incident severity, but have never had a formal on-call policy for after hours
For the policy side, I'd probably just tweak the response policy to state that the expectation that severity X though Y incidents will include after hours support from the oncall, everything else will be handled during normal business hours
After that it's basically implementation details / workflow that's team specific about • scheduling • rotation length / dates • required response time • how/when to deal with times of unavailability of the oncaller • how/when to deal with excessive wear of the oncaller • expectation of alerts that would trigger out of hours response • expectation of actions for oncaller • expectation for escalation when oncaller needs an assist • expectation of project work during oncall • tracking alerts / pages to the oncaller and regular lookback analysis
Being overly prescriptive is not often helpful, because by nature alerts/incidents are ambiguous (the not ambiguous ones really should have fixes prioritized quickly) so a lot of of what you want to do is document expectations on outcomes and effects, and then use training and retrospectives to build the cultural muscle for how to get there.
The above is a good set of things to do if you need to throw something together and then start iterating on it. The Google SRE Workbook has a chapter on this which has a lot of valuable insight on how to design alerts, the purpose of oncall, etc... Not all of it will align with your needs / purposes, but is useful background that covers a bit of why of the above points.
I'd have a look at the guides incident have put together: https://incident.io/guide I've worked in some of the same companies this company was founded out of and it is a really good group of people with a wealth of knowledge.
Yeah, that looks like a nice overview
The subtle point is that oncall is where your profession starts to turn into a lifestyle (eg: now I have to have a laptop with me when I go the gym, sometimes I need to be awake at <time I usually sleep>) so it's important to be very aware of the human impact and the cultural ramifications and try to balance those with what the business realistically needs to succeed. It's iterative (as are most things lol) but one where the investment is continuous learning to adapt quickly will pay off pretty quickly.
Anyway, I happen to be thinking about this stuff for $nextjob so you get a bit of a braindump. I hope it is helpful!