Large Language Models
Claude Opus 4.6 Tackles 12-Hour Coding Marathons—But the Metrics Are Wobbling
Imagine an AI chewing through software tasks that bury humans for 12 hours straight. That's Claude Opus 4.6 on the METR chart—but the numbers' confidence interval spans 5 to 66 hours. Is progress exploding, or just a mirage?