Hi, some thoughts for you:
• Prometheus -> Mimir-> Grafana is useful for monitoring and alerting.
◦
https://github.com/kubernetes-monitoring/kubernetes-mixin provides a lot of useful standard dashboards and alerts, but it's also good to define your own for things you want to be notified of or want to take action on, as out-of-the-box dashboards and alerts are not always actionable in your specific situation. Mimir is for longer-term storage of the metrics, so in your dashboards you can look back in time a few months.
• Note that sometimes Kubernetes has to kill/move your pod for reasons you didn't predict (e.g. due to lack of resources on the node, or the node having to be replaced). You can work with priorities to influence
what will be killed, but it's hard to guarantee that k8s will always keep it running, while also maintaining/auto-scaling the cluster. It depends on your cluster setup of course, we consider our cluster nodes to be immutable and expect them to rotate relatively quickly (either due to autoscaling, or e.g. cluster version upgrades or fresh machine images for nodes).
◦ Also note that not
all workloads are a great match with Kubernetes, esp. very long-running jobs or those that keep in-memory state. It might be worth looking into breaking down the work of those background jobs into smaller chunks, or use checkpointing using external storage so that when such a pod restarts it can resume its work from a saved state.