Implement MXFP8 Training to Boost MoE Throughput on Blackwell GPUs

Meta AI · Performance Improvement · 2026-03-12 · notable

Briefing for: Engineering

What happened

Meta AI demonstrated a 30.2% training speedup for Llama 4 Scout MoE models by utilizing MXFP8 (microscaling formats) dynamic quantization within the TorchAO and TorchTitan frameworks. The optimization leverages NVIDIA's 5th generation tensorcores (tcgen05) on GB200 clusters, specifically targeting grouped GEMMs in the routed experts of Mixture-of-Experts architectures.

Why it matters

Training MoE models at scale is notoriously compute-intensive; this update proves that moving from BF16 to MXFP8 provides a massive speedup without sacrificing model convergence. You can now use the `_to_mxfp8_then_scaled_grouped_mm` prototype API to optimize memory bandwidth and compute throughput on the latest Blackwell hardware.

What this enables

If you run large-scale MoE training, you can reduce wall-clock time by 30% by switching routed experts to the MXFP8 primitive.
If you are using Blackwell GPUs, you can maximize hardware utilization by using specialized layout transformations for block-scaled GEMMs.
If you manage training pipelines in TorchTitan, you can now toggle MXFP8 configs to achieve bfloat16-equivalent loss curves with higher token throughput.

Get personalized AI briefings for your role at Changecast →