Accelerate Blackwell Diffusion Inference with MXFP8 and NVFP4 Quantization
Meta AI · Performance Improvement · · notable
Briefing for: Engineering
What happened
Meta demonstrated end-to-end inference speedups of up to 1.26x with MXFP8 and 1.68x with NVFP4 on NVIDIA Blackwell (B200) GPUs using the TorchAO and Diffusers libraries. The research focuses on Flux.1-Dev, QwenImage, and LTX-2 models, utilizing microscaling formats that group elements into small blocks to preserve accuracy at lower bit-depths.
Why it matters
As diffusion models for video and high-res images grow in size, VRAM and compute constraints become the primary blockers for production serving. These new formats provide a 3.5x smaller memory footprint than BF16, allowing you to serve larger models or larger batches on the same Blackwell hardware without significant visual quality degradation.
What this enables
- If you are serving LTX-2 or Flux.1-Dev on B200s, you can reduce peak memory consumption from ~38GB to ~21GB for a batch size of 1.
- If you need to maintain visual fidelity, you can use the provided 'selective quantization' heuristics to skip layers where precision is critical, such as small linear layers or normalization.
- If you face CPU overhead during low-batch inference, the demonstrated CUDA Graphs recipe can recover performance and enable significant speedups.
Get personalized AI briefings for your role at Changecast →