A Datacenter Scale Distributed Inference Serving Framework.
NVIDIA Dynamo is a high-throughput low-latency inference framework designed for serving generative AI and reasoning models in multi-node distributed environments. Dynamo is designed to be inference engine agnostic (supports TRT-LLM, vLLM, SGLang or others) and captures LLM-specific capabilities.
Related contents:
Run your own AI cluster at home with everyday devices
Forget expensive NVIDIA GPUs, unify your existing devices into one powerful GPU: iPhone, iPad, Android, Mac, Linux, pretty much any device!
Related contents:
NVIDIA TensorRT
is an ecosystem of APIs for high-performance deep learning inference. TensorRT includes an inference runtime and model optimizations that deliver low latency and high throughput for production applications. The TensorRT ecosystem includes TensorRT, TensorRT-LLM, TensorRT Model Optimizer, and TensorRT Cloud.
Unused RAM is wasted RAM, so why not put some of that VRAM in your graphics card to work?
vramfs is a utility that uses the FUSE library to create a file system in VRAM. The idea is pretty much the same as a ramdisk, except that it uses the video RAM of a discrete graphics card to store files. It is not intented for serious use, but it does actually work fairly well, especially since consumer GPUs with 4GB or more VRAM are now available.
On the developer's system, the continuous read performance is ~2.4 GB/s and write performance 2.0 GB/s, which is about 1/3 of what is achievable with a ramdisk. That is already decent enough for a device not designed for large data transfers to the host, but future development should aim to get closer to the PCI-e bandwidth limits. See the benchmarks section for more info.