918 tokens per second.
That’s not some lab fantasy — it’s the real-world throughput PyTorch and Nebius squeezed out of a 256-GPU NVIDIA B200 cluster training DeepSeek-V3’s colossal 671B Mixture-of-Experts model. Up 41% from the BF16 baseline. And yeah, they did it with open-source tools you can replicate yourself. But here’s the thing: in the endless GPU arms race, who’s actually cashing in? Nebius on cloud rentals? NVIDIA on Blackwell sales? Or the indie researchers dreaming of frontier models?
I’ve chased these benchmarks for two decades now, from the CUDA 1.0 days when NVIDIA promised the moon and mostly delivered vaporware. This one feels different — tangible gains on actual MoE workloads — but don’t pop the champagne yet. Let’s unpack the smoke and mirrors.
Why Bother with MXFP8 and DeepEP on DeepSeek-V3?
MoE models like DeepSeek-V3? They’re beasts. Experts scattered across GPUs, tokens routed dynamically — it’s a communication nightmare wrapped in compute hunger. Standard all-to-all collectives choke on variable sizes; GEMMs eat cycles but scream for precision tweaks.
Enter MXFP8: NVIDIA’s Blackwell-native FP8 format, microscaling every 32 elements to keep math stable while doubling TFLOPS on tensor cores. No emulation BS — pure hardware speed. TorchAO makes it drop-in for linears and those grouped GEMMs in expert layers.
Then DeepEP: swaps vanilla all-to-all for GPU-direct NVLink/RDMA kernels. Less CPU meddling, lower latency for MoE’s shuffle fest. Orthogonal wins, they claim — compute from MXFP8, comms from DeepEP.
DeepEP alone yields 859 token/sec (+32%) over the BF16 baseline (651 token/sec). Adding MXFP8 on grouped GEMMs and combining that with DeepEP pushes the performance to 918 token/sec, a +41% total throughput gain.
Numbers don’t lie. Or do they? This is synthetic pre-training steps on Nebius Cloud — controlled, optimized, no real-world data mess.
Loss curves match BF16 for the 16B version over 1,500 steps. No degradation. Good sign — low-precision doesn’t tank convergence here.
But.
Scale it to 10,000 GPUs? Real datasets? That’s where fairy tales crumble.
Does MXFP8 Really Unlock 2x GEMM Speed Without the Usual Precision Drama?
Blackwell’s tcgen05.mma instructions? Native MXFP8 glory, up to 2x BF16 flops. TorchAO quantizes inputs dynamically — forward, grads, all three GEMMs per layer. For grouped ops in MoE experts, where matrices balloon, overhead vanishes.
Sounds perfect. Except history whispers caution. Remember AMP in Volta? Tensor cores hyped, then NaN explosions everywhere. Or FP8 pilots that needed heroic scaling hacks.
MXFP8’s block scales help — shared E8M0 per 32 elems preserves range better than plain E4M3. And Blackwell hardware? First time it sings without software crutches.
Our unique twist: this isn’t just faster flops. It’s a stealth pivot for NVIDIA. BF16 ruled Hopper; now Blackwell forces FP8 adoption to hit peak specs. Software like TorchTitan catches up quick — PyTorch/Nebius collab proves maturity — but expect MXFP8 mandates in future frameworks. Who’s locked in? Cloud giants renting B200s at premium.
Skeptical me sees parallel to 2017’s V100 rush: everyone chased DGX pods, only FAANG afforded ‘em. Today, Nebius bills by the hour; your $10k cluster run? Pocket change for them.
Is DeepEP the MoE Communication Savior We’ve Waited For?
All-to-all in MoE: two per layer, dynamic routes. NCCL chokes — fixed patterns assumed, CPU overhead kills.
DeepEP? GPU-direct sends, NVLink/RDMA bliss. Variable sizes? Handled. EP scales? Check.
+32% alone on 671B. Composes with MXFP8 for 41%. TorchTitan integrates smoothly — expert-parallel pipeline purrs.
Yet, cynicism kicks in. Communication was 30-40% of cycles in prior MoE runs; fix it, sure, but compute scales worse. DeepSeek-V3 at 671B? Still needs absurd power — 256 B200s guzzle megawatts.
Prediction: DeepEP becomes table stakes for Blackwell MoEs. Open-source? Repro recipes given. But proprietary clouds like Nebius win the rental wars. Indie labs? Stick to H100s or pray for spot instances.
Reproducibility section screams legitimacy — full TorchTitan recipes, Nebius access. No black box.
The Money Trail: Who Profits from This 41% Bump?
NVIDIA: Blackwell sales velocity. B200 clusters justify $3M+ pods.
Nebius: Cloud margins on optimized runs. PyTorch tie-in? Ecosystem lock.
DeepSeek? Free speed for their V3 — but they’re already ahead.
You? If you’re training, yes — if you can afford it. Otherwise, it’s benchmark porn.
My bold call: this accelerates the MoE explosion, but widens the chasm. Hyperscalers train 10T+ params; open models lag further. Remember GPT-3 era? Open-source clawed back with Llama. Here, hardware moats grow.
Experiments pristine — 256 B200s, BF16 base, orthogonal opts. 16B convergence solid. But ‘frontier-scale’? 671B is big, not GPT-4o territory.
Bottom Line for Builders
TorchTitan + TorchAO + DeepEP: repro magic. Try the 16B first.
Cynic’s advice: benchmark your workload. MXFP8 shines on large GEMMs; small experts? Meh.
B200 arrival? Game-changer for MoE, if you ignore power bills.
We’ve seen peaks before. This one sticks — for now.
🧬 Related Insights
- Read more: Opensource.com’s Quiet Revolution: Fixing the .com Mismatch in Open Source’s Heart
- Read more: 43 Minutes from Issue to Production: Prism-MCP’s Agent Auth Wake-Up Call
Frequently Asked Questions
What is MXFP8 training and how does it work on NVIDIA B200?
MXFP8 is a block-scaled FP8 format natively supported by Blackwell tensor cores, doubling GEMM throughput over BF16 via TorchAO quantization in linear and grouped ops — stable convergence proven here.
How much faster is DeepSeek-V3 training with DeepEP and MXFP8?
Up to 41% total gain to 918 tokens/sec on 671B model over BF16 baseline, with DeepEP alone at +32% — fixes MoE all-to-all bottlenecks.
Is TorchTitan ready for production MoE pre-training?
Yes, fully open-source PyTorch framework with repro recipes; scales to 256 B200s smoothly in this Nebius test.