Hey y'all, do platform teams typically have on-cal...
# general
e
Hey y'all, do platform teams typically have on-call ?
s
Yes - as far as I could tell.
l
I would say it's more common to be on-call than not to be - depends on the size of the company. A lot use a follow the sun system if they are fully remote, to ensure someone is working hours to support the customers
s
They should be on-call, but specifically for incidents relating to the platform. They shouldn't be removing any on-call from steam aligned teams, which ought to respond to issues with their own applications. There is usually an escalation path from the on-call steam aligned folks to the on-call platform folks.
e
@Louise Ogilvy What's the sun system?
e
Follow the sun - 24/7 coverage with teams around the globe
l
Hi @Elliot Partridge absolutely that! It's more common in remote teams where they will hire engineers in all time zones to be available during working hours to support customers. It can help towards removing the need to expect engineers to be on-call :)
b
At Slice our DevOps/Platform/SRE function is on call as we are the only ones who have a fully holistic view of all the services.
a
I wouldn’t go signing a large and expensive support contract - i’ll put it that way. It really depends on your culture if you have a culture where everyone takes ownership of their work and the platform is designed properly with a self service type of setup then it’s most likely your platform team has also designed self healing type of applications, rolling updates etc. So in this space i’d argue that 1 person from the Platform on call should suffice - it really does depend on team dynamic, platform capabilities, culture etc.
Having this issue right now with a client where we have a support contract with them - but if they just built the platform properly they could get rid of us (trying to automate myself out of a job - that’s always the aim of the game!) However a lot of it is falling on deaf ears - which highlights a huge cultural problem (could be backed by financial issues that I have no visibility on).
k
We have ;)
b
(trying to automate myself out of a job - that’s always the aim of the game!)
Always this.
However a lot of it is falling on deaf ears - which highlights a huge cultural problem
People / Politics are the 8th layer of the 7-layer OSI model.
a
@Bob Eckert if i write that into LinkedIn will I be expecting calls from Lawyers? that is bloody terrific
j
One take-away is also to have platform oncalls but to not page them for every outage it they may not be causing. Especially for internal dev platforms, a team need to distinguish the cause from the owner, i.e. a broken user flow because of a wrong config switch has to be fixed by the config owner…not the person running the platform.
v
Follow up on this, how do you manage you on calls? Is there a tool out there that is widely adopted? or is this just an entry in calendars?
s
I've used PagerDuty to manage schedules and to sound the alarm. It escalates if the on-call doesn't acknowledge, so you can handle issues with people being out-of-signal (or really heavy sleepers!)
j
From the bottom of my heart, do not try and build your own paging platform..there are great SAAS solutions out there. 😄 Oncall work is not just alerting. Its noise management, its incident analytics, its an auto-recovery enabler, a great gateway to enable efficient automated incident comms and at its core a schedule and escalation tool for your oncall teams.
v
Oh no, I am not building one 😅. Just trying to understand widely used systems so that we can integrate that in our product.
a
@Vivek Dwivedi we use PagerDuty also - though an ugly beast a working beast 🙂