As large language models (LLMs) continue to expand in size and complexity, the need for efficient and cost-effective performance solutions becomes increasingly critical. Recently, NVIDIA announced that its H100 Tensor Core GPUs, paired with TensorRT-LLM software, have set new performance records on the industry-standard, peer-reviewed MLPerf Inference v4.0 benchmarks, according to the NVIDIA Technical Blog. This achievement highlights the capabilities of NVIDIA’s full-stack inference platform.
Mixtral 8x7B and Mixture-of-Experts Architecture
The Mixtral 8x7B model, developed by Mistral AI, employs a Mixture-of-Experts (MoE) architecture. This design offers potential advantages in model capacity, training cost, and first-token serving latency compared to traditional dense architectures. NVIDIA’s H100 Tensor Core GPUs, built on the Hopper GPU architecture, and TensorRT-LLM software have demonstrated outstanding performance with the Mixtral 8x7B model.
Optimizing Throughput and Latency
In large-scale LLM deployments, optimizing query response times and throughput is crucial. TensorRT-LLM supports in-flight batching, allowing completed requests to be replaced with new ones during LLM serving, thereby enhancing performance. Choosing the right response time budget involves balancing throughput and user interactivity, with plots of throughput versus latency serving as useful tools.
FP8 Precision and Performance Gains
The NVIDIA Hopper architecture includes fourth-generation Tensor Cores that support FP8 data type, offering twice the peak computational rate compared to FP16 or BF16. TensorRT-LLM supports FP8 quantization, enabling the conversion of model weights into FP8 and the use of highly-tuned FP8 kernels. This results in significant performance benefits, with the H100 GPU delivering nearly 50% more throughput within a 0.5-second response time limit.
Streaming Mode and Token Processing
In streaming mode, the performance of H100 GPUs and TensorRT-LLM is notable. Instead of waiting for the full inference request to complete, results are reported as soon as an output token is produced. This approach allows for high throughput even with very low average time per output token. For instance, a pair of H100 GPUs running TensorRT-LLM with FP8 precision achieves a throughput of 38.4 requests per second with a mean time per output token of just 0.016 seconds.
Latency-Unconstrained Scenarios
In latency-unconstrained scenarios, such as offline tasks like data labeling and sentiment analysis, the H100 GPUs show remarkable throughput. At a batch size of 1,024, inference throughput reaches nearly 21,000 tokens per second with FP8 precision. The Hopper architecture’s FP8 throughput capabilities and reduced memory footprint enable processing of larger batches efficiently.
TensorRT-LLM: Open-Source and Optimized
TensorRT-LLM is an open-source library designed for optimizing LLM inference, providing performance optimizations for popular LLMs through a simple Python API. It includes general LLM optimizations, such as optimized attention kernels, KV caching, and quantization techniques like FP8 or INT4 AWQ. Mixtral with TensorRT-LLM can be hosted with NVIDIA Triton Inference Server software.
Future Innovations
NVIDIA continues to innovate, with products based on the groundbreaking Blackwell architecture expected later this year. The GB200 NVL72, combining 36 NVIDIA Grace CPUs with 72 NVIDIA Blackwell GPUs, aims to deliver significant speedups for real-time 1.8 trillion parameter MoE LLM inference.
For more information, visit the NVIDIA Technical Blog.
Image source: Shutterstock