:wave::skin-tone-3: Hey friends, I was looking at ...
# general
👋🏼 Hey friends, I was looking at the recent Reddit and GitHub outages due to change config so decided to write about the challenges of upgrades (including k8s upgrades) and their downstream impact! I'd love to hear about your experiences and what other challenges have you seen in the process. TL;DR Why are Kubernetes upgrades so challenging? Why isn’t a Kubernetes upgrade as easy as an iPhone upgrade experience? 1. Kubernetes isn’t, and shouldn’t be, vertically integrated 2. You don’t know what’ll break before it breaks 3. Application performance impact is hard to quantify 4. Stateful sets are pets… in a universe of cattle 5. Rollbacks are a pain 6. Components hit end-of-support/end-of-life frequently 7. Getting an upgrade right takes a lot of time 8. There is no way to learn from and avoid each others’ mistakes 9. Communication between Platform & App teams isn’t streamlined 10. Safe change management is hard. It’s 10x harder for infrastructure k8s Details of each point: https://www.linkedin.com/posts/fawadkhaliq_eks-gke-aks-activity-7049432203713269760-Bv02/
I just let the upgrade bake in our qa/stage clusters mostly works 😂, but its not that bad, granted their small and probably not as complicated as bigger companies, their also on eks so that helps.
Maybe ill eventually get around to creating a testing pipeline for them to spin up, do a upgrade and spin down on demand
Nice! Simple is better. Less downstream dependencies. Great to hear you have automation in place -- should serve as an inspiration for other folks. I'd love to hear the details and share with rest of the world how you got there!
Mostly manual tf for now, but basically just a blue green cluster deploy, drain pods into new cluster, destroy old cluster, let it bake a week then do the same on stage, if it works then prod, but I want to move it to a gitlab pipeline and then use that to spin up short lived clusters and upgrade them to see if anything fails, ill probably end up moving to cluster api by then though instead of TF 😂
one interesting thing i noticed in a recent upgrade was lack of transparency how external cloud components such as load balancer is managed by the cloud controller. My client experience an outage when they found their static ip got changed for the Ingress load balancer. --- Later on, we did blue-green cluster and tested and switched the DNS entry. I particularly find blue green strategy much better, less risky but obviously expensive to do it.
thanks for sharing this @Fawad Khaliq, I had exactly similar things on my mind yesterday when I was talking to my friend -- how to do upgrades seamlessly.
@Fawad Khaliq thanks for sharing. Our team published an article a while ago on how to gain enough confidence in doing the upgrades. The challenges and Do’s and Don’t are captured in this nice article articulated by one of our team mates. https://medium.com/scout24-engineering/how-did-we-upgrade-our-eks-clusters-from-1-15-to-1-22-without-k8s-knowledge-2c96c1a94cc1
The major point that we miss during k8s upgrade is which all APIs are deprecated in the new versions of kubernetes. To solve this gke has wonderful solution they properly list which all APIs are deprecated this gives a proper heads-up to.plan the upgradation.
Fairwinds Pluto will help with listing deprecated APIs, for those of us not on gke 😀
Yes correct the tool is also helpful
With each upgrade we deploy a clusterpolicy (using kyverno) that flags resources matching deprecated APIs in next version. ref: https://kyverno.io/policies/best-practices/check_deprecated_apis/check_deprecated_apis/
closer to the upgrade time change the policy from
to block deprecated resources getting created.