inference
a Kubernetes-native high-performance distributed LLM inference framework.
llm-d is a Kubernetes-native distributed inference serving stack, providing well-lit paths for anyone to serve large generative AI models at scale, with the fastest time-to-value and competitive performance per dollar for most models across most hardware accelerators.
Related contents:
The future of on device AI. Deploy high-performance AI directly in your app — with zero latency, full data privacy, and no inference costs.
Uzu is a high-performance inference engine for AI models on Apple Silicon.
A Datacenter Scale Distributed Inference Serving Framework.
NVIDIA Dynamo is a high-throughput low-latency inference framework designed for serving generative AI and reasoning models in multi-node distributed environments. Dynamo is designed to be inference engine agnostic (supports TRT-LLM, vLLM, SGLang or others) and captures LLM-specific capabilities.
Related contents:
Cerebras Inference The world’s fastest inference -70x faster than GPU clouds,128K context, 16-bit precision.
Cerebras Inference Llama 3.3 70B runs at 2,200 tokens/s and Llama 3.1 405B at 969 tokens/s – over 70x faster than GPU clouds. Get instant responses to code-gen, summarization, and agentic tasks.
Related contents: