Large Language Models
Sandbox Bug Turns LLM Judge into Model Blamer: The Postmortem
Everyone figured autonomous LLM-as-judge setups were ready for prime time — plug-and-play truth machines for coding benchmarks. Then a sandbox hiccup delivered two rock-solid wrong verdicts, exposing how infra ghosts haunt even the sharpest evals.