llm-d

llm-d is a Kubernetes-native, high-performance distributed LLM inference framework built on vLLM and the Kubernetes Gateway API Inference Extension, providing intelligent inference scheduling, prefix-cache-aware routing, prefill/decode disaggregation, hierarchical KV offloading, and traffic- and hardware-aware autoscaling across NVIDIA, AMD, Intel, and Google TPU accelerators.

llm-d was accepted to CNCF on March 12, 2026 at the Sandbox maturity level.

Visit Project Website

[shopify_products collection="llm-d"]