Excellent write-up by Reddit engineering on the Pi Day outage. This team has a strong availability culture, but even so, upgrading Kubernetes versions and components remains a big challenge (which was the trigger for the 03/14 outage)
I'd love to know how is your experience managing upgrades?
https://www.reddit.com/r/RedditEng/comments/11xx5o0/you_broke_reddit_the_piday_outage/
Blue green and argocd. Using Terraform. We do not trust in place upgrades and even so there are cluster wide components that have the same blast radius