This message was deleted.
# kubernetes
s
This message was deleted.
t
Welcome to AKS I guess... In theory, I'll create a new node group commit those changes, spin up the nodes, then delete the old node group with the next PR. However, IN PRACTICE I often find that the AKS api does a woefully inadequate job at managing changes like these and gets hung up trying to migrate workloads... so before I delete the legacy node group I just taint all the nodes and I start killing off pods that I know are replicated on my good nodes. Workloads like redis and postgres get special attention (the actual data is stored in persistent volumes so I won't have a catastrophic loss if all my pods are gone at once, but a lot of things rely on those stateful dbs). Bassically, I'm prepping the cluster to give the AKS api as easy of a job as possible. Once the legacy node group is prepped, then I commit the deleted node group to IaC and watch the cluster to make sure it doesn't get hung on any nodes. For node upgrades I commit the change to IaC and run them, then immediately taint all the nodes and "help" the AKS process when it gets stuck on nodes. Regardless how I try to upgrade AKS or change the nodes I find its an awfully manual process. The Azure portal has the same issues, and if your upgrade or change times out, its kinda a pain in the ass to fix. I just find it much easier to shepherd k8s through the process.
t
I'm on AWS EKS and the problem is the same: we often workloads that take longer than 15 mins (the EKS node termination timeout). These workloads are not checkpointed and we don't want to restart a lot of them. Terraform (or other IaC tooling) typically isn't aware of the state on the running machines, so all it can do is a "dumb" replacement of machines or node groups, assuming it's fine to destroy any machine as needed, and timeout means failure. AFAIK all you can do is create a new node group next to it, then ensure all your workloads migrate "manually". We set the priority of the old node group to 0 for the ClusterAutoscaler, and I cordon the nodes to ensure nothing new starts on the old nodes in the mean time. Then, usually after some hours in our case, the old node group can be removed from Terraform. I'm all ears for better ways 🙂 (it's possible to make Terraform execute arbitrary commands, and use null resources to keep track in the state file of having executed those, but that's just making it worse)
t
One way I've scripted it in the past is to: 1. deploy the new node pool 2. apply taints to the old node pool 3. Take the HPAs/Replica sets/etc and double them 4. wait for all the workloads to be ready 5. take the HPAs/Replica sets/etc and half them and wait for ready 6. delete the old nodepool Definitely kinda the brute force method, and it costs more, but its a pretty simple to automate process.
j
Yikes this is alot more hands on than I expected! And there I was thinking our architecture might go as far as to have a GitOps tool (ArgoCD) actually create the clusters (Crossplane etc) in a fleet... I guess that gets even worse unless you go down the route of building your own operator to manage that reconciliation... Has anyone considered treating clusters as entirely immutable. I.e. if you want to make any change at the AKS level (so not the k8s resources on the cluster) you would actually create a new cluster, somehow drain the old one and switch over to the new one. So I guess a blue/green deployment approach but at the AKS level not an app level?
t
Well it depends on the workload ofc. When only running stateless APIs it's not that painful; but in practice we need more than that. In our setup I wouldn't go blue/green at cluster level that quickly (even though we have everything in code); just because it takes a while (10-15 mins) to get the new cluster fully up and running, but it's also about the (cloud) resources outside of the cluster (e.g. load balancer, DNS entries, storage) that are managed by operators/controllers in the cluster. AFAIK it's not trivial to "hand over" the ownership of those external resources from cluster in a graceful / zero-downtime way.
t
So as terrible as this process sounds... its MUCH better than it was a year or two ago I've noticed some big improvements in the AKS control plane. Quite frankly, I don't mess around with my clusters all that much. I upgrade versions about twice a year to keep me well within the supported version range. To me, a blue/green deployment is WAY overkill for something that might take me about an hour per cluster twice a year. Furthermore, upgrading versions often requires some manual hands on anyways... That being said, its not a terrible strategy to have a blue/green cluster setup for other reasons (like regional failover) that you can switch between during an outage or an upgrade. If you can get away with the lag created by transactional replication in a db over the distance, than it gives me all kinds of warm and fuzzy feelings knowing I've got a warm cluster sitting there waiting for traffic. Some tools that you can have at your disposal to make this a bit easier and economical are Istio and KEDA. With KEDA you can keep all your workloads on the failover cluster scaled to 0 until you start getting requests on the cluster (so you can keep the cluster warm, without it costing an arm and a leg). With Istio, you can put both clusters in the same service mesh and automatically route requests to the available cluster. Istio is pretty fault tolerant, so even in the midst of an upgrade or node pool scaling I would expect it to be able to route requests pretty reliably (just as long as you don't destroy the whole cluster at once or something).