Meta Muse Spark Benchmarks: Closed AI Shift

$14.3 billion. That’s what Meta dropped in June 2025 for a 49% stake in Scale AI—not for the data labeling biz, but to snag Alexandr Wang as its first Chief AI Officer.

And here’s Muse Spark, the fruit of that labor: Meta’s debut closed-source model, locked behind an API, no weights released. Zuckerberg’s 2024 manifesto swore open-source AI was the future. Reality hit different.

Look, developers built empires on Llama. Fine-tuned it, forked it, deployed it everywhere. Now? Muse Spark strips that away, offering free trials on meta.ai but zero self-hosting. It’s proprietary—more so than OpenAI’s setups, as The Register snarked.

What Do the Muse Spark Benchmarks Actually Say?

Muse Spark clocks a 52 on the Artificial Analysis Intelligence Index (v4.0). Solid fourth place, trailing Gemini 3.1 Pro and GPT-5.4 at 57, Claude Opus 4.6 at 53. Not shabby for a nine-month sprint from the new Meta Superintelligence Labs.

But slice it by category, and the specialist shines—or flops.

Medical AI? HealthBench Hard: 42.8. Smokes GPT-5.4’s 40.1, obliterates Gemini’s 20.6 and Grok’s 20.3. No contest.

Scientific reasoning: 50.2% on Humanity’s Last Exam (no tools), topping Gemini Deep Think’s 48.4% and GPT-5.4 Pro’s 43.9%. FrontierScience Research? 38.3% to GPT’s 36.7%.

Visuals too—CharXiv Reasoning at 86.4%, edging GPT’s 82.8.

Coding, though. Terminal-Bench 2.0: 59.0. GPT-5.4 laps it at 75.1, Gemini at 68.5. Sixteen points back. Developers, that’s your red flag.

Abstract reasoning? ARC-AGI-2: 42.5 versus ~76 for the leaders. A chasm.

“Opening Llama doesn’t undercut our revenue, sustainability, or ability to invest in research like it does for closed providers.”

Zuckerberg, July 2024 manifesto. Still up on Meta’s blog. Words intact. Actions? Not so much.

The efficiency angle grabs you, though. Muse Spark spit out 58 million tokens across evals—half of GPT-5.4’s 120M, a third of Claude’s 157M. Meta claims it trained on “over an order of magnitude less compute” than Llama 4 Maverick.

If legit — and early signs say yes — this lab just cracked cheaper paths to frontier performance. Free access on meta.ai doesn’t bankrupt them. Smart.

Why Did Chinese Models Steal Meta’s Open-Source Crown?

Meta invented the Llama playbook. Developers flocked. Then Qwen happened.

By January 2026, Alibaba’s Qwen clan hit 700 million Hugging Face downloads. December 2025 alone? Outpaced the next eight models combined: Meta, DeepSeek, OpenAI, Mistral, Nvidia, Zhipu, Moonshot, MiniMax.

February 2026: Qwen derivatives at 69% share. Llama’s? Plummeted from 25% in late 2023 to 11%. China: 1.15 billion total downloads. Meta’s open ecosystem: 723 million, cut off mid-sentence in reports, but you get it.

Here’s my take — the one nobody’s yelling yet. This mirrors Netscape’s browser wars. Open-source evangelists built the web’s foundations, only for proprietary giants (ahem, Microsoft) to commoditize and dominate via integration. Meta’s going proprietary not just for control, but because open-source became a race to the bottom against state-backed Chinese compute floods. Bold prediction: Expect Llama 5 weights delayed indefinitely. Open-source AI? Niche for hobbyists now.

Llama 4 flopped hard in April 2025 — Fortune called it a “dud,” benchmarks accused of fudging. Egg on face. Combine with Qwen’s steamroll, and Wang’s hire screams pivot: Build closed, specialized monsters to claw back medical, science dollars where regs favor incumbents.

But developers? You’re collateral. No more Llama forks for your stack. Muse Spark’s API might juice meta.ai traffic, feed ad data back to the beast — classic Zuckerberg.

And Wang? The $14.3B prodigy. Scale AI labeled the world’s data; now he’s got Meta’s war chest. MSL isn’t iterating Llama. It’s forging something new, efficient, domain-crushing.

Is it betrayal? Nah. Survival. Open-source democratized AI until it didn’t — for Meta. Chinese firms scaled it better, faster, with Beijing’s invisible hand.

How Does This Reshape Developer Workflows?

Short term: Skip Muse Spark for code. It’s medical rocket science, not your CLI buddy.

Longer? Efficiency wins mean Meta can flood niches. Imagine healthcare apps powered by this, locked in, no escape.

The architectural shift? From generalist behemoths guzzling tokens to lean specialists. Why burn 157M evals when 58M suffices? That’s the how — distillation, smarter architectures under Wang’s watch. Why? Profit without OpenAI’s investor begging.

Skeptical? Me too on the “hope to open-source future versions” tease. Smells like PR vapor.

China’s dominance forces hands. Meta built the playground; rivals brought the crowds. Now it’s pay-to-play.

🧬 Related Insights

Read more: Rune: Rust’s No-Nonsense AI Runtime – Ready for Agents, Desperate for Coders
Read more: TradeClaw’s 38.8% Win Rate Nets 21.83% Gains in 48 Hours—Trading’s Dirty Secret

Frequently Asked Questions

What are Muse Spark benchmarks?

Muse Spark scores 52 overall on Artificial Analysis Index, dominates medical (HealthBench 42.8) and science, but lags coding (59 on Terminal-Bench) and abstract reasoning.

Why did Meta release a closed-source model?

After Llama flops and Chinese models grabbing 69% open-source share, Meta hired Alexandr Wang for $14.3B Scale stake to build efficient, proprietary specialists via new MSL labs.

Does Muse Spark kill open-source AI?

Not yet, but it signals Meta’s pivot—efficiency lets them compete closed while Chinese open models rule downloads; predict more closures ahead.

Meta Muse Spark Benchmarks: Closed AI Shift

Key Takeaways

What Do the Muse Spark Benchmarks Actually Say?

Why Did Chinese Models Steal Meta’s Open-Source Crown?

How Does This Reshape Developer Workflows?

🧬 Related Insights

Frequently asked questions

Worth sharing?

⚡ Key Takeaways

What Do the Muse Spark Benchmarks Actually Say?

Why Did Chinese Models Steal Meta’s Open-Source Crown?

How Does This Reshape Developer Workflows?

🧬 Related Insights

Frequently asked questions

Share this article

Worth sharing?

Related Stories

YC's Garry Tan Hypes Fake AI Benchmarks, Drops His Own Prompt-Folder 'Memory' Toy

NVIDIA's Nemotron Smokes a 397B Giant: My Ollama Cloud Benchmarks Reveal the Speed Trap

LLM Pricing Hell: This Open-Source Tracker Scrapes Sanity from the Chaos

Runsight: YAML Tames AI Agents Before They Eat Your Budget

Stay in the loop

Key Takeaways