How do you think about dividing responsibility bet...
# general
b
How do you think about dividing responsibility between developers and ops for cluster instances vs app instances? To me, it makes sense that developer should manage application cpu/memory and min/max instance count. But the cluster must be able to support that with sufficient instance sizes and count. So do you have the developers manage that too? Or do ops manage that, setting an upper bound on the limit. And to go beyond that, developers have to collaborate with ops to get that increased? Or something else like automatically set cluster max based on all the application max instance count?
t
Drive to give developers ability to self service. Set bounds not on cluster size limit, but on the budget. Set KPIs on resource utilization vs over provisioning and over spending and measure. Make sure you utilize what you pay for (+ resonable headroom). Automate and supply FinOps reports to the teams for their stuff. Give them data how much they spend, how much they actually used, and did they met SLOs or not. Over time teams will build responsibility and accountability for how much they spend and learn how to manage that. example scenario 1: you bought a lot, used 50% of it, and SLOs are not met. You know you bought to much and the performance issue/root cause probably lies somewhere else than infrastructure size example scenario 2: services meet SLOs, you bought just right and used just right, but the business says you spend to much. Communicate to business engineering needs to shift focus from new features development to optimization and find bottlenecks, identify whats under performing etc. Then business can make a decision if spending is acceptable or not, and balance between development and optimization. And it the end, no ops nor devs are solely responsible for the infrastructure size, but it's a collaborative effort between development, infra and the business.
m
And avoid micro-managing resources, let them automatically scale and adjusts instead. And then monitor aggregated metrics: SLOs, cluster utilization, overall cost
s
It's going to depend on your organization, culture & general sense of accountability and responsibility. Some immediate/advance feedback allows developers to understand how all those little costs add up ($1/hr seems negligible, $744/month should be considered), before it's spent instead of waiting for the monthly surprise bill. Worse, a single developer/manager with the attitude "it's not my budget" who seriously over provisions can generate bills visibly affecting the bottom line. Coming back even a month later can result in tens or hundreds of thousands of unbudgeted waste. I don't know how people are handling it.
m
oh you need to monitor aggregated metrics constantly, not just one a month. But not manually of course - let the monitoring system alert you when something goes crazy big