IBM, Red Hat, Google Cloud Donate LLM-D Project to CNCF, Pioneering Scalable AI Inference Framework
March 24, 2026
The roadmap emphasizes expanding adoption, supporting next-generation AI architectures, multi-modal workloads, more inference engines, and optimization for multi-LoRA environments, while bridging inference with training concepts like reinforcement learning and self-managing optimization.
llm-d offers hierarchical cache offloading across GPU, CPU, and storage to enable larger context windows, along with traffic- and hardware-aware autoscaling tailored for LLM workloads.
The llm-d project, a Kubernetes-native distributed inference framework for large language models, has been donated by IBM Research, Red Hat, and Google Cloud to the Cloud Native Computing Foundation (CNCF) as a sandbox project at KubeCon Europe 2026, with initial contributions from NVIDIA and CoreWeave and broad industry and university support.
Inference is increasingly treated as an enterprise systems problem, requiring governance, abstraction, multi-tenant model serving, request prioritization, and support for diverse accelerators.
Red Hat executives frame the initiative as aligning AI workloads with CIO-oriented Kubernetes platforms and enterprise operational practices.
A core concept is disaggregated serving, splitting prefill and decode stages into independently scalable pools to improve latency control and resource allocation.
Early testing by Google Cloud showed roughly a twofold improvement in time-to-first-token for use cases like code completion due to specialized routing, disaggregation, and cache management over traditional autoscalers and API routing.
Red Hat contributed llm-d to CNCF to advance Kubernetes-based LLM inference at scale, seeking a vendor-neutral, community-governed blueprint for production-grade deployments.
The donation establishes llm-d as a community-governed platform for scalable, vendor-neutral LLM inference, aiming to standardize deployment, governance, and interoperability across cloud environments.
This is an early-stage CNCF community effort, signaling a move from experimentation toward enterprise-infrastructure institution-building for AI workloads.
The project aims to provide vendor-neutral, cloud-native, high-performance LLM inference capable of low latency and high throughput at scale, via intelligent inference scheduling, prefix-cache-aware routing, hierarchical KV-cache offloading, prefill/decode disaggregation, and traffic- and hardware-aware autoscaling.
CNCF is positioned to standardize distributed inference deployment and management, converging common patterns, APIs, and governance for AI infrastructure, akin to Prometheus or Envoy in their domains.
Summary based on 3 sources
Get a daily email with more AI stories
Sources

IBM • Mar 24, 2026
Donating llm-d to the Cloud Native Computing Foundation
SiliconANGLE • Mar 24, 2026
Red Hat bets big on Kubernetes inference with llm-d - SiliconANGLE
The New Stack • Mar 24, 2026
IBM, Red Hat, and Google just donated a Kubernetes blueprint for LLM inference to the CNCF