A question for the build fast, break fast teams ou...
# platform-leadership
s
A question for the build fast, break fast teams out there — How do you ensure reliability during phases that require high feature velocity? I assume there are some cutbacks on time spent on testing, any unique practices your team follows to reduce risks?
s
The research (DORA, SlashData, et al) suggests that technical practices mean throughput and stability can both improve. There's no trade off. If you sacrifice stability to get throughput, it suggests a puzzle piece is missing. It's okay to "break things" in terms of trying an idea and finding it's not helping achieve your goal. It's not as much about bringing services to their knees with every deployment 😄 Throughput and stability both enhance your ability to run more experiments with your software. That's the kind of "move fast" that's really desirable.
s
Toil reduction through automation is one of the best technical practices to follow as it aids consistency.
j
one of the strategies i’ve seen used at scale is a production readiness scorecard / checklist that needs to be satisfied before a new service gets deployed to production. you pay for the reliability upfront so expect an initial “slowdown” as a one time start up cost that will pay you dividends in the long run
s
That's a good starting point for sure @Jordan Chernev, do you have an example checklist or a reference that I could use?
j
you can search for examples on google for an actual example. this one is always a great starting point - https://sre.google/sre-book/launch-checklist/
i also found a lot of results just now by searching for “production readiness checklist template” on google
some interesting results, including cached .docx files
look around…
s
Be mindful that checklists can quickly become an antipattern. They're easily gamed, ignored, or pushed to the side by teams and overcome by events when things go sideways. @Jordan Chernev when you say at scale, are you talking about scale for a specific service, number of services brought up in a period of time, or total number of teams supported by a platform? I'm trying to think of instances from my clients where I've seen checklists provide greater value than heartache over the long term and I can't remember any. Teams tend to see checklists as impediments rather than value-adding outputs. So building automation and templated paths to production that test go-live capabilities at agreed upon service level objectives are more effective, efficient, and reduce latency in feedback loops. If you're talking real platform engineering, building or improving paths to production alongside your developers is the way to go. Checklists just put a wall between platform teams and developers.
j
number of services brought up in a period of time, or total number of teams supported by a platform
these two
So building automation and templated paths to production that test go-live capabilities at agreed upon service level objectives are more effective, efficient, and reduce latency in feedback loops.
the two aren’t orthogonal in my mind
create your checklist, start templatizing / automating against it
If you’re talking real platform engineering, building or improving paths to production alongside your developers is the way to go. Checklists just put a wall between platform teams and developers.
somewhat agree. we are discussing SRE practices in this thread, less so platform engineering
there is a difference in terms of missions and concerns between PEs and SREs
the chart in the middle of the article is great
s
Platform Engineers at their highest level of maturity should care about all of those things. It's one of the reasons why a strong platform engineering team is not easy to build. DevOps and SRE encompass practices that platform engineering teams must also become proficient in. It's a little bit like Maslow's hierarchy of needs. For Platform Engineers you start at technical proficiency, grow into operational efficiency, and eventually on to fully realized developer experience. I don't agree that mature SRE, DevOps, and Platform Engineering teams shouldn't have full overlap. There's no need for that kind of separation of duties in a performant organization. The divisions should be business lines, not technical lines. Anyhow, if you're talking taking a checklist you build more automation against over time, that makes more sense, something that as a dev pushes code automatically validates the feature on its way to prod in an observable way, that is a Good Thing. I understand that you can't do it all at once. Checklists as a thing that becomes a static wall that developers must bang their heads against are not. It's a little too easy for the artifact to become a monument if the platform team isn't focused on user-centricity and feedback loops.
j
i feel like we are closer to an agreement here, as opposed to not. in case it helps, most of my experience and lens here is colored by being a member of a multi-thousand technologist community, e.g. 4k+, comprised of 620+ product teams across 5 timezones
s
Probably. And I won't get out a 📏 .
a
managed delivery is the answer. Checklists are awful :)
s
Whenever I see a checklist, I get a premonition of automation! You sometimes have to create the checklist (temporarily) to map out what's happening... but most checkbox exercises can and should be automated.
j
steve gets it
s
Usually 5 minutes after everyone else though 😄