According to Heidy Khlaaf, Chief AI Scientist at the AI Now Institute, large language models have a roughly 50% accuracy rate when deployed in real-world scenarios—yet Anthropic and OpenAI are already negotiating with the U.S. Department of War to integrate them into lethal autonomous weapons systems.
Let that sink in. We’re not talking about a chatbot that occasionally gives bad movie recommendations. We’re talking about deploying algorithms that fail half the time to make targeting decisions in active combat zones.
The tech industry’s response? Safety theater. Oversight frameworks. Red lines. All of it misses the actual problem.
The Hallucination Problem Nobody Wants to Name
Here’s what’s wild about this situation: everyone knows the problem exists, and no one’s pretending it doesn’t. But the framing has become this weird exercise in distraction.
“Current LLM safety is far from the reliability and accuracy measures that have long been a prerequisite for defense and safety-critical systems.”
Khlaaf nails it there. For decades, military and aerospace systems have operated under strict reliability standards—we’re talking 99.99% accuracy thresholds for critical operations. Pacemakers, aircraft autopilots, nuclear submarine control systems. These industries learned long ago that “good enough” humans operating the system doesn’t solve a fundamentally broken tool.
But generative AI’s hallucinations aren’t bugs—they’re features. They’re built into the probabilistic nature of how these models work. And the companies making them have explicitly said these issues will persist. Think about that sentence for a second. They’re telling us the problem is unfixable, then turning around and pitching these models for warfare.
The “Human in the Loop” Isn’t Actually a Loop
So here’s where the safety theater really breaks down. Anthropic and OpenAI keep talking about human oversight, about decision support systems, about keeping humans “in the loop.”
But that’s where automation bias enters the room—uninvited, and nearly impossible to banish. Researchers have known for years that when humans work alongside AI systems, they tend to over-trust them, especially under pressure. Put a military operator in a high-stress combat scenario, staring at an AI recommendation, and what do you think happens? The human becomes a rubber stamp.
The distinction between an AI-Decision Support System (which they’re using now) and an actual Lethal Autonomous Weapon (LAWS) shrinks to almost nothing in practice. A human technically makes the final call, sure. But if the AI is providing the targeting information, and that information is 50% accurate, you’ve essentially weaponized a coin flip.
Why the Fog of War Breaks These Models
Khlaaf makes another crucial point that rarely makes it into policy discussions: LLMs can’t handle novel scenarios. They work when the problem fits their training data. Throw them into something outside that distribution—say, an unprecedented military situation, environmental variables they’ve never seen, the actual chaos of combat—and they don’t degrade gracefully. They fail catastrophically and often don’t signal that they’re failing.
The fog of war isn’t a metaphor. It’s the condition where information is incomplete, contradictory, and rapidly changing. It’s literally the worst possible environment for a probabilistic model that’s trained on historical data. It’s the environment where a 50% accuracy rate doesn’t just cost money or damage reputation. It costs lives. Often, the wrong lives.
Is Anyone Actually Asking the Right Questions?
What frustrates me about this whole situation—and maybe you should be frustrated too—is that the debate has been gerrymandered. Everyone’s arguing about whether there should be a human in the loop, what safety protocols are adequate, what oversight structures make sense.
Nobody’s asking the question that should come first: Is this technology fit for this purpose at all?
In aviation, pharmaceuticals, nuclear energy, we don’t argue about how much oversight can make a fundamentally unsafe tool acceptable. We either fix it, replace it, or ban it. But with AI weapons, the conversation defaults to management theater because there’s money involved. Anthropic and OpenAI have defense contracts (or are pursuing them). The Pentagon has budgets to spend. The incentive structure points toward deployment, not skepticism.
The Real Problem Nobody’s Solving
And here’s the thing that keeps me up: this isn’t a problem that better safety training solves. It’s not a problem that more red lines solve. It’s not a problem that publishing a responsible AI charter solves.
These companies are fundamentally selling a product with known, admitted, unfixable flaws for use in systems where failure means indiscriminate killing. That’s not a governance problem. That’s a business problem masquerading as a technical one.
The negotiations between Anthropic, OpenAI, and the Department of War will probably result in some new framework, some new safety protocols, some new oversight board. And then, in six months or a year, these models will be integrated into military systems anyway. Because everyone involved has already decided that “acceptable risk” is code for “profitable deployment.”
Meanwhile, the actual risk—fundamentally unreliable algorithms making life-and-death decisions—gets papered over with another layer of safety theater.
🧬 Related Insights
- Read more: EU AI Act: 14 Countries Haven’t Even Picked Their Watchdogs Yet
- Read more: EU Commission Dumps 130 AI Act Tasks on Itself – Bureaucracy Alert
Frequently Asked Questions
Can generative AI be fixed for military use? Not according to the companies making it. Hallucinations are inherent to how LLMs work probabilistically, and both Anthropic and OpenAI have stated these issues will persist. That’s a design limitation, not a training problem.
What does “50% accuracy” mean in a military context? It means the AI’s output is wrong roughly half the time. In a targeting system, that’s catastrophic. A human operator under automation bias is likely to accept the recommendation anyway, especially in high-stress combat scenarios.
Why is human oversight not enough? Automation bias—the tendency to over-rely on automated systems—means the human often becomes a checkbox rather than a genuine safeguard. If the underlying data is 50% accurate, no amount of human review makes that acceptable for lethal decisions.