I am on a team where our Platform is rapidly expan...
# kubernetes
r
I am on a team where our Platform is rapidly expanding and we are looking to improve our DR. I currently believe the best strategy is either Active-active (shift to multi-cluster horizontal scaling) or Active-passive (same thing as A-a but one cluster's LB rules are higher priority). I also believe that the overhead cost of having multiple clusters would not be that bad since we would just be scaling out the same number of worker nodes across two clusters vs one. Are these opinions true? (assume all our workloads are stateless)
Any advice appreciated! Or links to good discussions on the topic. I found https://www.suse.com/c/rancher_blog/disaster-recovery-preparedness-for-your-kubernetes-clusters/ and have used this as a jumping off point for preparedness
m
I think its sometimes helpful to distinguish between DR (expected failure modes) and Business Continuity (Unexpected failure modes). All companies need BC, but many do not need DR. i.e You are more likely to be affected by malicious or accidental damage, than impacted by a regional / zonal failure. Your best protection against this is the ability to recover or restore infrastructure rapidly. One of the ways todo this is through blue-green cluster deployments. i.e Instead of upgrading / changing clusters in place, provision a new cluster, shift traffic and then delete the old cluster. In a DR / BC scenario you are just provisioning a new cluster in a new region / zone - it however can be impacted by zone capacity as others evacuate an affected region An active-active / active-warm multi-cluster setup can have lower availability and/or perform worse in a DR event (Especially if whatever broke the first cluster, broke the 2nd), An active-cold setup combines the best of both - A secondary cluster is kept running, but maintains a completely different deployment/maintenance/IAM profile to prevent against common failures, while retaining enough capacity to ensure when you need to recover you are not reliant on new infrastructure coming up.
r
Great points!
We have it all coded to spin up a new cluster, configure plugins, then deploy the apps/services but right now it's in pieces vs one unified blue-green pipeline. I personally am going to advocate for a solution like what you are describing (active-cold) but I am just trying to keep an open mind to other possibilities.
c
How about specifying your goals first? It is way more easy to recommend something to improve your posture if we know what the target posture is meant to be like. Also the most obvious but also overlooked point - “Everybody has a plan until people start hitting you in the face.” Any DR/BC plan needs to be proven as workable with fire drills, so that you can actually execute with confidence when you need to.
r
I understand every business is different and have different needs. I guess I should not have phrased a strategy as "the best". really just looking for anyone who has created clusters in an active-active, active-passive, or some other strategy and what they found to be the pros and cons of those implementations to be.
b
We went through this progression, and I agree 💯 with Jay and Moshe. What are the business goals/requirements? Is it uptime or is it continuity? Both are tied together and both influence each other. Costs scale though as you add 9's to that availability and it usually is a combination of DR and HA. Availability comes down to what Moshe said. Good CD and reliable deployments with blue/green or progressive deploys helps a ton here. You are bound by the uptime of your infra as well. If you're in AWS, you can only guarantee as much uptime as the combined possible downtime of the services you are using. If you're using s3 and it has an SLA per region of 99.9%, that's the best you can guarantee for your own stuff. For most people this is good enough. If you have hot/cold (IaC using Terraform, standing up a new cluster from scratch in a disaster), you are going to consider RTO and RPO. How long does it take to recover and how much data did you lose? For a bit more infra costs and very little engineering overhead you can setup a Warm/Standby. All the infra is there but scaled to zero. Replication across regions. This will get you 30minutes or so of recover time, depending on how long it takes to make things hot. Active/Passive was the next thing we looked at. Again you can do this without any engineering work. Your applications do not need to be built to support multiregion. There is more cost because you're paying for compute that is sitting idle. But with this you can use AWS global services to automatically switch between regions. This will get you more availability but at a higher cost. Then you get to Active/Active. You usually have to do this in 3 regions and the apps your serving have to work across regions. You'll need to geolocate, possibly shard your database across location, etc... Cost scales up quite a bit in terms of infra and in terms of engineering efforts. The long and short of it is, each 9 of availability you add with a combination of reliable deployments, failover/DR, and adding in multiple regions will increase costs, sometimes exponentially.
You usually won't need multi-region active/active unless you have a big global enterprise company that needs lower latency for all of the regions or needs 99.999% uptime. I've worked with big multinational banks and a huge consumer electronics company. They needed it and could afford it. The ROI was right for them as they'd lose a ton of money from any interruptions. For most other companies it's not worth the cost.
s
Active/Active also requires applications written/designed for it. Traditional designs of having a singular, transactionally consistent database can't just be turned into multi-master across inter-region latencies. Along the lines of what the others were saying, you need to look at your business requirements. Active/Active is an appealing engineering target. But quite possibly not worth the cost (in $ and time) to engineer, develop, operate, and limits on velocity (as changes will have to be made with those constraints in mind). What's likely to change that: 1. True worldwide traffic that you can't just assign customers to a region. 2. Latency (but CDN design may fix that). 3. A CEO looking at the rare, short-term AWS regional outage and saying: I don't care how much it costs, don't let it happen to us.
Even if you get the perfect Active/Active design, it doesn't protect you from: 1. Programming errors/failure to handle events 2. Human/AI/automation mistakes 3. Attackers 4. Bizarre problems Combined, those are usually more likely than even short term region failures and probably won't be protected by Active/Active.