AI R&D Metrics: Tracking Self-Improvement

Everyone’s been betting on steady AI gains, right? You know, the kind where labs pump out bigger models every couple years, hype builds, stocks twitch, and we all nod along. But here’s the gut punch: it’s not steady. It’s a freight train off the rails.

Ajeya Cotra—sharp mind who’s nailed some timelines before—drops a bomb. Back in January, she pegged AI agents at 24-hour horizons by end of ‘26. Now? METR’s Opus 4.6 is already at 12 hours. And that’s just ten months in.

“It’s no longer very plausible that after ten whole months of additional progress at the recent blistering pace, AI agents would still struggle half the time at 24 hour tasks,” she writes.

Blistering. That’s the word. By year’s end, she bets over 100 hours on software tasks. Weeks of work, autonomous. Forget horizons; the map’s melting.

What Was Everyone Expecting, Anyway?

Silicon Valley’s old guard—like me, after 20 years kicking tires—thought we’d see incremental wins. Claude 3.5 here, GPT-5 there. Consultants would charge fortunes to “integrate” it. But no one’s ready for agents that code marathons without coffee breaks.

This flips the script. Software engineering? Toast. Economy? AI colonizes it overnight. Lights flashing yellow, as Import AI puts it. Yellow for caution—or explosion.

And who cashes in? Not you, staring at your screen. It’s the labs hoarding compute, the VCs betting billions. Bytedance’s quietly coding CUDA agents; satellites run AI on-device. The party’s private.

Look.

We’ve got a paper from GovAI and Oxford laying out 14 metrics to gauge AI R&D Automation (AIRDA). That’s AI building AI. The self-improvement loop everyone’s whispered about since Yudkowsky’s glory days.

Why obsess? Because AIRDA isn’t just progress—it’s an event horizon. Benefits rocket, sure. But so do bioweapons, nukes, mass job die-offs. They list ‘em plain.

14 Metrics to Spot the Monster Growing

Short version: track AI vs. humans on R&D tasks. Relative performance. Oversight red-teaming—can we babysit these beasts? Misalignment checks. Efficiency ramps.

Staff surveys: how much AI boosts productivity? High-stakes decisions handed to silicon? Researcher time allocation. Bug rates slipping through oversight. Subversion tallies—AI screwing humans on purpose?

Headcount of AI whizzes. Compute splits across R&D phases. Permissions creeping up. It’s exhaustive. Almost too tidy for the chaos ahead.

One’s missing, though—my twist: echo the Manhattan Project’s secrecy logs. Back then, they tracked physicist morale, espionage risks, yield predictions. We need AI-specific “defection rates”—how often models hide capabilities from evaluators. History screams: measure the human element, or it bites.

But wait—companies should log differential progress. Safety vs. capabilities. Does oversight keep pace? Or does R&D spawn black boxes we can’t parse?

Is AI R&D Automation Already Here?

Hell if I know. Labs test kernel-writing, model-training agents today. Proxies abound. But real metrics? Nah. We’re flying blind into recursive loops.

Picture it: AI designs better chips, trains fiercer models, iterates. Exponential. Cotra’s update? Proof we’re closer than skeptics (me included) admitted.

Cynical aside—PR spin calls this “empowering.” Empowering who? The C-suites automating R&D while pink-slipping coders. ByteDance’s CUDA agent? Chinese TikTok overlords scripting GPUs sans humans. On-device sat AI? Spies in orbit, no cloud lag.

Who’s making money? Compute barons—Nvidia, up 200% YTD. Not the drone pilots in Ukraine pondering AI wars.

This changes everything.

Governance? Metrics first. Mandate these 14. Track compute shares, permissions. Govs peek inside labs. Or we hit the horizon blind.

Bold call: by 2027, AIRDA metrics become law, like crash tests for cars. Ignore ‘em, and it’s not yellow lights—it’s red sirens.

Safety folks cheer. But remember: metrics lagged nukes too. Oppenheimer knew yields; we got Hiroshima.

Why Does This Scare the Valley Vets?

I’ve seen bubbles—dotcom, crypto. This ain’t hype; it’s hardware. Progress decouples from talent; compute + data = gods.

Cotra’s shift? Vindication for doomers. Yet optimists crow “economy boom.” Boom for whom? Gig workers? Nah.

Unique bit: parallels the transistor rush. Bell Labs measured yields obsessively. We skipped that for AI. Fix it, or regret.

🧬 Related Insights

Read more: Google Caves: Simple Toggle Lets You Ditch AI Search in Photos After Backlash
Read more: Inside DeepSeek R1: The Four Paths to Smarter LLM Reasoning

Frequently Asked Questions**

What are the 14 AI R&D metrics?

They’re benchmarks like AI vs. human performance on R&D, oversight effectiveness, misalignment risks, compute usage shifts—full list in the GovAI paper.

How fast is AI progress really accelerating?

Ajeya Cotra says her 2026 forecasts are already toast; agents hit 12-hour tasks now, 100+ hours by EOY.

Will AI R&D lead to self-improving superintelligence?

Possibly—AIRDA metrics aim to detect it early, but without tracking, we’re gambling.

AI R&D Metrics: Tracking Self-Improvement

Key Takeaways

What Was Everyone Expecting, Anyway?

14 Metrics to Spot the Monster Growing

Is AI R&D Automation Already Here?

Why Does This Scare the Valley Vets?

🧬 Related Insights

Frequently asked questions

Worth sharing?

⚡ Key Takeaways

What Was Everyone Expecting, Anyway?

14 Metrics to Spot the Monster Growing

Is AI R&D Automation Already Here?

Why Does This Scare the Valley Vets?

🧬 Related Insights

Frequently asked questions

Share this article

Worth sharing?

Related Stories

Mythos: The AI That's Hunting Bugs Faster Than Humans Can Blink

AI Models Sabotage Servers to Save Their Digital Pals

AI Agents: The Shift from Answering Questions to Taking Over Tasks

PINNs vs Neural Operators: Physics' AI Fork in the Road

Stay in the loop

Key Takeaways