GDPA Kernels: 2x Faster Meta RecSys Training

Meta cracked RecSys training bottlenecks.

And it’s not hype—it’s kernels hitting 97% Tensor Core utilization on NVIDIA B200s, under a 750W cap no less. Their Generalized Dot-Product Attention (GDPA) redesign, built on Flash Attention 4, tackles the mess of real-world data: short sequences, jagged batches, massive scales. We’re talking GEM, Meta’s biggest RecSys foundation model, where standard dot-product attention falls flat.

Look, RecSys isn’t like LLMs. Users don’t feed models uniform tokens; they click ads in bursts, sequences vary wildly. Standard kernels—optimized for synthetic benchmarks—choke here. Meta’s team measured a 2.6x forward gap, 1.6x backward, worst cases 4x. That’s production pain, straight from clusters.

What Even Is GDPA—and Why Skip Softmax?

GDPA swaps softmax for activations like GELU or SiLU. Think Kunlun’s PFFN blocks or HSTU’s sequential modeling—they preserve magnitudes better, dodge probability pitfalls. It’s everywhere in Meta’s RecSys stack: self-attention in GEM, PMA, PFFN in InterFormer.

“GDPA captures a broad class of attention-like modules used in production RecSys—such as self-attention, PMA, and PFFN—which share a common pattern of two matrix multiplications with an optional activation in between.”

Unify ‘em under one kernel? Smart. They use GELU as example, but it scales.

Here’s the thing: starting from Tri Dao’s FA4—the SOTA for LLMs—they reworked for RecSys chaos. Large batches, variable lengths, non-softmax. Result? Forward: 1,145 BF16 TFLOPs (2x over Triton original). Backward: 702 TFLOPs (1.6x). Vs. FA4 on production shapes: 3.5x forward, 1.6x backward.

Why Do Benchmarks Lie for Real RecSys Workloads?

Synthetic data? Normal distributions, fixed max lengths. Real traffic? User-driven spikes, asymmetry everywhere. Pipeline stalls, memory overlap tanks. Fig. 2 in their post shows it stark: real-world kernel lags CUTLASS FMHA by those multiples on B200s.

But Meta didn’t whine. They redesigned. Tiled for short seqs, fused ops tighter, handled jaggedness head-on. Pushed to roofline—97% utilization screams it. Full model? Over 30% throughput jump.

Short para: Impressive.

Now, my take—and it’s sharper than their PR. This echoes FlashAttention’s 2022 debut: LLM kernels ignored RecSys, now RecSys flips script. Bold prediction: competitors like ByteDance or Google RecSys teams scramble, shaving weeks off cycles. Meta’s edge in ads revenue? Baked in hardware now. (Their GEM trains faster, iterates quicker—pure market moat.)

How’d They Pull Off the Optimizations?

Rethought from FA4: workload-driven. Explicit short-seq handling—dynamic tiling kills waste. Better compute-memory balance for large batches. Non-softmax means no log-sum-exp hacks; custom activations slotted smoothly.

Evaluated on deployed B200s—real power caps, real traffic. Not lab toys. Code’s open: github.com/facebookresearch/ads_model_kernel_library. Triton base, but production-hardened.

And it generalizes. Beyond attention, irregular shapes elsewhere. That’s the gem—principles for any kernel fighting real data.

Pause. RecSys training’s a beast: billions params, daily retrains. 30% throughput? That’s not incremental; it’s deploy-or-die territory.

Does This Matter Beyond Meta?

Absolutely. Open kernel library means anyone with B200s (or A100/H100 equivalents) grabs it. RecSys players—think TikTok algos, Amazon recs—face same pains. LLMs get glory, but ad dollars flow from these models. Meta’s sharing? PR win, sure, but cements their kernel lead.

Critique time: they nod to FA4, but Triton original? Buried. It’s production-driven design winning—benchmark chasers lose.

One sentence: Game on for kernel wars.

Deeper: historical parallel to cuDNN evolutions. Early days, generic GEMMs ruled; now bespoke attention variants dominate. GDPA’s the RecSys cuBLAS moment—tailored, dominant.

The Numbers Don’t Lie: Raw Performance Breakdown

Forward: 1,145 TFLOPs BF16. 97% utilization. 2x vs. prior, 3.5x vs. FA4 on jagged prod shapes.

Backward: 702 TFLOPs, 1.6x both ways.

Full stack: 30%+ throughput. Under 750W—cluster realities.

Variable lengths? Handled. Large batches? Optimized.

This isn’t vaporware. Deployed in Meta clusters.

Will GDPA Kernels Reshape GPU Training?

They should—for RecSys, yes. Prediction: within a year, forks everywhere, 20-50% norms in adtech. But LLMs? Less so; softmax stays king there. Still, principles bleed over.

Meta’s move screams confidence. Open-sourcing post-deployment? Rare. Signals maturity.

Final thought: if you’re training big RecSys, drop everything. Grab the repo.

🧬 Related Insights

Read more: GitLab’s MCP Bridge: Finally Killing Dev Tool Context Switching?
Read more: Cloudflare’s Organizations: 133K Lines of Code to Conquer Enterprise Account Hell

Frequently Asked Questions

What is Generalized Dot-Product Attention?

GDPA replaces softmax in attention with activations like GELU or SiLU, powering Meta’s RecSys models like GEM and Kunlun for better real-world interactions.

How much faster is Meta’s GDPA kernel on NVIDIA B200?

Up to 2x forward, 1.6x backward vs. priors; 3.5x forward vs. FlashAttention 4 on production shapes, hitting 97% utilization.

Where can I get the GDPA kernel code?

Open-source at https://github.com/facebookresearch/ads_model_kernel_library/blob/main/gdpa/README.md—ready for your clusters.

GDPA Kernels: 2x Faster Meta RecSys Training

Key Takeaways

What Even Is GDPA—and Why Skip Softmax?

Why Do Benchmarks Lie for Real RecSys Workloads?

How’d They Pull Off the Optimizations?

Does This Matter Beyond Meta?

The Numbers Don’t Lie: Raw Performance Breakdown

Will GDPA Kernels Reshape GPU Training?

🧬 Related Insights

Frequently asked questions

Worth sharing?

⚡ Key Takeaways

What Even Is GDPA—and Why Skip Softmax?

Why Do Benchmarks Lie for Real RecSys Workloads?

How’d They Pull Off the Optimizations?

Does This Matter Beyond Meta?

The Numbers Don’t Lie: Raw Performance Breakdown

Will GDPA Kernels Reshape GPU Training?

🧬 Related Insights

Frequently asked questions

Share this article

Worth sharing?

Related Stories

ThunderKittens 2.0 Unleashes Blazing GPU Kernels

41% Faster DeepSeek-V3 Training on B200s: Real Speedup or NVIDIA Sales Pitch?

TurboQuant: The Restaurant Hack That's Freeing Up AI's GPU Bloat

Intel: Up to 30% Game Performance Hiding in Your CPU!

Stay in the loop

Key Takeaways