Hello all I have got a couple of questions regarding kuberne Platform Engineering #kubernetes

Hello all, I have got a couple of questions regard...

Imad Hashmi

03/13/2025, 12:23 AM

Hello all, I have got a couple of questions regarding kubernetes: • What is everyone using to monitor their kubernetes pods? I would like to know when a pod crashes and restarts, memory consumption, cpu utilisation etc. • I have some background jobs running within the pods, I want k8 cluster to wait for those jobs to finish before it replaces them. Currently it only waits for the external requests to drain before the pod is killed and replaced which means the external jobs are killed in the middle of the process.

Thomas Kraus

03/13/2025, 10:40 AM

Hi, some thoughts for you: • Prometheus -> Mimir-> Grafana is useful for monitoring and alerting. ◦ https://github.com/kubernetes-monitoring/kubernetes-mixin provides a lot of useful standard dashboards and alerts, but it's also good to define your own for things you want to be notified of or want to take action on, as out-of-the-box dashboards and alerts are not always actionable in your specific situation. Mimir is for longer-term storage of the metrics, so in your dashboards you can look back in time a few months. • Note that sometimes Kubernetes has to kill/move your pod for reasons you didn't predict (e.g. due to lack of resources on the node, or the node having to be replaced). You can work with priorities to influence what will be killed, but it's hard to guarantee that k8s will always keep it running, while also maintaining/auto-scaling the cluster. It depends on your cluster setup of course, we consider our cluster nodes to be immutable and expect them to rotate relatively quickly (either due to autoscaling, or e.g. cluster version upgrades or fresh machine images for nodes). ◦ Also note that not all workloads are a great match with Kubernetes, esp. very long-running jobs or those that keep in-memory state. It might be worth looking into breaking down the work of those background jobs into smaller chunks, or use checkpointing using external storage so that when such a pod restarts it can resume its work from a saved state.

Mykola Dzham

03/13/2025, 12:45 PM

kube-prometheus-stack It contains prometheus operator deployment, grafana, and preset of grafana dashboards and prometheus alerts that are pretty reasonable to monitor K8s clusters

Mykola Dzham

03/13/2025, 12:50 PM

You most likely need to increase

terminationGracePeriodSeconds

. And ensure your process properly handles

SIGTERM

signal: https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#pod-termination

Open in Slack

Previous Next