theAIcatchup

41% Faster DeepSeek-V3 Training on B200s: Real Speedup or NVIDIA Sales Pitch?

918 tokens per second. That's the blistering pace for pre-training DeepSeek-V3's 671B monster on 256 NVIDIA B200s, thanks to MXFP8 and DeepEP tweaks in TorchTitan. Hype or hardware reality?

5 min read 4 weeks, 1 day ago

Diagram comparing DeepSeek V3 MLA and GQA in LLM architectures

Large Language Models

DeepSeek V3's Latent Attention Crushes KV Cache Bloat

DeepSeek V3 just compressed the LLM memory crisis. Its Multi-Head Latent Attention shrinks KV caches without killing performance—here's the data.

5 min read 1 month ago

#deepseek-v3

41% Faster DeepSeek-V3 Training on B200s: Real Speedup or NVIDIA Sales Pitch?

DeepSeek V3's Latent Attention Crushes KV Cache Bloat