theAIcatchup

Meta's GDPA Kernels Deliver 2x RecSys Training Speedups

Meta engineers just unveiled GDPA kernels that slash training times for massive RecSys models. Up to 3.5x forward speedups on production traffic—real numbers from B200 clusters.

4 min read 1 month ago

🔧

AI Hardware

41% Faster DeepSeek-V3 Training on B200s: Real Speedup or NVIDIA Sales Pitch?

918 tokens per second. That's the blistering pace for pre-training DeepSeek-V3's 671B monster on 256 NVIDIA B200s, thanks to MXFP8 and DeepEP tweaks in TorchTitan. Hype or hardware reality?

5 min read 1 month ago

#nvidia-b200

Meta's GDPA Kernels Deliver 2x RecSys Training Speedups

41% Faster DeepSeek-V3 Training on B200s: Real Speedup or NVIDIA Sales Pitch?