https://platformengineering.org logo
#general
Title
# general
f

Fawad Khaliq

04/05/2023, 8:49 PM
👋🏼 Hey friends, I was looking at the recent Reddit and GitHub outages due to change config so decided to write about the challenges of upgrades (including k8s upgrades) and their downstream impact! I'd love to hear about your experiences and what other challenges have you seen in the process. TL;DR Why are Kubernetes upgrades so challenging? Why isn’t a Kubernetes upgrade as easy as an iPhone upgrade experience? 1. Kubernetes isn’t, and shouldn’t be, vertically integrated 2. You don’t know what’ll break before it breaks 3. Application performance impact is hard to quantify 4. Stateful sets are pets… in a universe of cattle 5. Rollbacks are a pain 6. Components hit end-of-support/end-of-life frequently 7. Getting an upgrade right takes a lot of time 8. There is no way to learn from and avoid each others’ mistakes 9. Communication between Platform & App teams isn’t streamlined 10. Safe change management is hard. It’s 10x harder for infrastructure k8s Details of each point: https://www.linkedin.com/posts/fawadkhaliq_eks-gke-aks-activity-7049432203713269760-Bv02/
h

Hugo Pinheiro

04/05/2023, 8:58 PM
I just let the upgrade bake in our qa/stage clusters mostly works 😂, but its not that bad, granted their small and probably not as complicated as bigger companies, their also on eks so that helps.
Maybe ill eventually get around to creating a testing pipeline for them to spin up, do a upgrade and spin down on demand
f

Fawad Khaliq

04/05/2023, 9:02 PM
Nice! Simple is better. Less downstream dependencies. Great to hear you have automation in place -- should serve as an inspiration for other folks. I'd love to hear the details and share with rest of the world how you got there!
h

Hugo Pinheiro

04/05/2023, 9:07 PM
Mostly manual tf for now, but basically just a blue green cluster deploy, drain pods into new cluster, destroy old cluster, let it bake a week then do the same on stage, if it works then prod, but I want to move it to a gitlab pipeline and then use that to spin up short lived clusters and upgrade them to see if anything fails, ill probably end up moving to cluster api by then though instead of TF 😂
a

Anjul Sahu

04/06/2023, 4:42 AM
one interesting thing i noticed in a recent upgrade was lack of transparency how external cloud components such as load balancer is managed by the cloud controller. My client experience an outage when they found their static ip got changed for the Ingress load balancer. --- Later on, we did blue-green cluster and tested and switched the DNS entry. I particularly find blue green strategy much better, less risky but obviously expensive to do it.
thanks for sharing this @Fawad Khaliq, I had exactly similar things on my mind yesterday when I was talking to my friend -- how to do upgrades seamlessly.
s

sumit nimbalkar

04/06/2023, 9:27 AM
@Fawad Khaliq thanks for sharing. Our team published an article a while ago on how to gain enough confidence in doing the upgrades. The challenges and Do’s and Don’t are captured in this nice article articulated by one of our team mates. https://medium.com/scout24-engineering/how-did-we-upgrade-our-eks-clusters-from-1-15-to-1-22-without-k8s-knowledge-2c96c1a94cc1
n

Nitish Chauhan

04/06/2023, 10:23 AM
The major point that we miss during k8s upgrade is which all APIs are deprecated in the new versions of kubernetes. To solve this gke has wonderful solution they properly list which all APIs are deprecated this gives a proper heads-up to.plan the upgradation.
h

Hugo Pinheiro

04/06/2023, 12:12 PM
Fairwinds Pluto will help with listing deprecated APIs, for those of us not on gke 😀
n

Nitish Chauhan

04/06/2023, 12:12 PM
Yes correct the tool is also helpful
i

Irfan

04/06/2023, 12:34 PM
With each upgrade we deploy a clusterpolicy (using kyverno) that flags resources matching deprecated APIs in next version. ref: https://kyverno.io/policies/best-practices/check_deprecated_apis/check_deprecated_apis/
closer to the upgrade time change the policy from
audit
to
enforce
to block deprecated resources getting created.
40 Views