Accelerate Blackwell Diffusion Inference with MXFP8 and NVFP4 Quantization

Meta AI · Performance Improvement · 2026-04-08 · notable

Briefing for: Engineering

What happened

Meta demonstrated end-to-end inference speedups of up to 1.26x with MXFP8 and 1.68x with NVFP4 on NVIDIA Blackwell (B200) GPUs using the TorchAO and Diffusers libraries. The research focuses on Flux.1-Dev, QwenImage, and LTX-2 models, utilizing microscaling formats that group elements into small blocks to preserve accuracy at lower bit-depths.

Why it matters

As diffusion models for video and high-res images grow in size, VRAM and compute constraints become the primary blockers for production serving. These new formats provide a 3.5x smaller memory footprint than BF16, allowing you to serve larger models or larger batches on the same Blackwell hardware without significant visual quality degradation.

What this enables

If you are serving LTX-2 or Flux.1-Dev on B200s, you can reduce peak memory consumption from ~38GB to ~21GB for a batch size of 1.
If you need to maintain visual fidelity, you can use the provided 'selective quantization' heuristics to skip layers where precision is critical, such as small linear layers or normalization.
If you face CPU overhead during low-batch inference, the demonstrated CUDA Graphs recipe can recover performance and enable significant speedups.

Get personalized AI briefings for your role at Changecast →