I recently had a great conversation with the Red H...
# general
s
I recently had a great conversation with the Red Hat team on my podcast, about llm_d, a new open-source effort that’s starting to tackle a problem we’re seeing more and more in production ML stacks: inference workloads becoming monolithic, heavy, and hard to scale. A few highlights from the discussion and why llm_d matters: Inspiration: llm_d draws a lot of inspiration from work done in projects like vLLM, which optimized inference on everything from laptops to DGX clusters (caching, speculative decoding, distribution). The problem: Today, we often run inference as one big container (model + runtime + observability + config + pipelines). When you scale, you end up copying too much state across nodes, inefficient and brittle. The idea: Treat the model and its runtime as disaggregated, first-class components inside Kubernetes. Break the container into parts (cache, prefill/decode, GPU-bound work, CPU-bound work) and let the platform place and scale each piece independently. Why it’s promising: cache-aware routing and componentized serving let you avoid unnecessary duplication, match workloads to the right resources (GPU vs CPU), and enable smarter scaling across clusters, which can reduce cost and improve responsiveness. The opportunity: If you’re building ML infra or platform capabilities, this opens a path to far more efficient inference at scale, especially as model sizes continue to grow.

llm_d

is still early, but it’s a practical, infrastructure-first approach to a real industry pain point. Would love to hear from folks building inference at scale.