> How do you deal with failure and debugging o...
# platform-blueprints
h
How do you deal with failure and debugging of infrastructure?
That's always a tricky topic. Right now our Container Platform team is looped in when necessary, e.g. if some EC2 Kubernetes worker node behaves in a strange way (that happens from time to time on AWS). Teams have the possibility to subscribe to cluster updates (we have >160 Kubernetes clusters) and can also stop any ongoing cluster updates if they suspect some issues. The Container Platform team has inspection tools to check cluster health, e.g. networking etc and relies on Prometheus/ZMON/Grafana for monitoring. In general we follow a continuous delivery approach for any changes going to production Kubernetes clusters. There is some KubeCon talk about this -->

https://www.youtube.com/watch?v=1xHmCrd8Qn8

We do not have advanced infrastructure debugging capabilities for engineering teams right now, but they can always request (automated, 4-eyes approval) cluster API access e.g. to start some Pod with Busybox or do port forwarding.
o
We do not have advanced infrastructure debugging capabilities for engineering teams right now, but they can always request (automated, 4-eyes approval) cluster API access e.g. to start some Pod with Busybox or do port forwarding.
Can I ask for some elaboration, what is the request process (we are building a similar process right now)?
h
We have a wrapper around
kubectl
and engineers can do `zkubectl cluster-access request REASON`on the CLI. It works with 4-eyes approval (the second engineer does
approve
) and can take a reference to an incident ticket for emergency access.
🙏 1