IBM, Red Hat, Google Cloud Donate LLM-D Project to CNCF, Pioneering Scalable AI Inference Framework

March 24, 2026
IBM, Red Hat, Google Cloud Donate LLM-D Project to CNCF, Pioneering Scalable AI Inference Framework
  • The roadmap emphasizes expanding adoption, supporting next-generation AI architectures, multi-modal workloads, more inference engines, and optimization for multi-LoRA environments, while bridging inference with training concepts like reinforcement learning and self-managing optimization.

  • llm-d offers hierarchical cache offloading across GPU, CPU, and storage to enable larger context windows, along with traffic- and hardware-aware autoscaling tailored for LLM workloads.

  • The llm-d project, a Kubernetes-native distributed inference framework for large language models, has been donated by IBM Research, Red Hat, and Google Cloud to the Cloud Native Computing Foundation (CNCF) as a sandbox project at KubeCon Europe 2026, with initial contributions from NVIDIA and CoreWeave and broad industry and university support.

  • Inference is increasingly treated as an enterprise systems problem, requiring governance, abstraction, multi-tenant model serving, request prioritization, and support for diverse accelerators.

  • Red Hat executives frame the initiative as aligning AI workloads with CIO-oriented Kubernetes platforms and enterprise operational practices.

  • A core concept is disaggregated serving, splitting prefill and decode stages into independently scalable pools to improve latency control and resource allocation.

  • Early testing by Google Cloud showed roughly a twofold improvement in time-to-first-token for use cases like code completion due to specialized routing, disaggregation, and cache management over traditional autoscalers and API routing.

  • Red Hat contributed llm-d to CNCF to advance Kubernetes-based LLM inference at scale, seeking a vendor-neutral, community-governed blueprint for production-grade deployments.

  • The donation establishes llm-d as a community-governed platform for scalable, vendor-neutral LLM inference, aiming to standardize deployment, governance, and interoperability across cloud environments.

  • This is an early-stage CNCF community effort, signaling a move from experimentation toward enterprise-infrastructure institution-building for AI workloads.

  • The project aims to provide vendor-neutral, cloud-native, high-performance LLM inference capable of low latency and high throughput at scale, via intelligent inference scheduling, prefix-cache-aware routing, hierarchical KV-cache offloading, prefill/decode disaggregation, and traffic- and hardware-aware autoscaling.

  • CNCF is positioned to standardize distributed inference deployment and management, converging common patterns, APIs, and governance for AI infrastructure, akin to Prometheus or Envoy in their domains.

Summary based on 3 sources


Get a daily email with more AI stories

More Stories