The Safety Mirage: How AI Guardrails Evaporate Under Pressure
Every major AI lab’s safety framework has a national security override. That's not a flaw — it's the design.
They told us the guardrails were in place. Safety policies. Red lines. Responsible‑AI frameworks. Government‑convened AI safety institutes. The story was clean and reassuring: AI is powerful and potentially dangerous, but the adults in the room have a plan, the labs are committed, the rules are written down, and the whole thing will hold.
Then the Pentagon called Anthropic.
On one side of the table: a company founded explicitly on the promise that safety comes first — whose entire public brand was built on the claim that it would not race blindly toward dangerous capabilities, and whose flagship Responsible Scaling Policy contained explicit commitments not to support weapons applications and mass surveillance without meaningful human oversight. On the other side: defence officials quietly explaining that “all lawful purposes” must include exactly the uses Anthropic’s policy was designed to stop. When Anthropic resisted, the message hardened. Blacklisting. Defence Production Act language. Supply‑chain risk designations. The subtext was clear: your principles are admirable; our leverage is bigger.
Within weeks, Anthropic quietly rewrote its Responsible Scaling Policy. Hard “we will not” commitments softened into “we aim to” and “we will transparently assess,” the once‑clear red lines diluted into language any good lawyer can route around when the right client applies the right pressure. Nothing exploded publicly. There was no scandal. Just a safety pillar quietly edited under duress, and a lesson available to anyone still clinging to the idea that voluntary corporate guardrails will hold against state power.
If this is what happens to the most loudly self‑described “safety‑first” lab on earth the first time its principles meet the Pentagon’s interests, then the blunt question is unavoidable: in 2026, when someone in a suit tells you “we’ve put guardrails in place,” what do those words actually mean?
What the Official Safety Story Looks Like
Walk through the formal AI safety architecture as it’s been assembled over the past several years, and it looks, on the surface, impressively serious.
Labs publish detailed responsible‑AI policies, preparedness frameworks, and model cards. They run internal evaluations and red‑team exercises. They establish safety boards and advisory panels. Governments stand up AI Safety Institutes — in the United States, the United Kingdom, Canada, the EU — tasked with benchmarking models, producing guidelines, and advising on risk. The EU AI Act creates mandatory requirements for high‑risk applications and establishes enforcement mechanisms. Executive orders are signed. International summits are convened.
The story these structures tell is: we know AI is dangerous; we’ve built a system to manage that danger; experts are watching; there are real red lines.
But understanding why the architecture looks this way requires asking a question the official story doesn’t encourage: who built it, and what problem were they actually solving? AI Safety Institutes were not created primarily because policymakers concluded that independent enforcement was necessary. They were created because labs needed to demonstrate responsible stewardship to forestall binding regulation, and because governments needed to demonstrate they were paying attention without yet committing to the political cost of actually constraining the industry. The Bletchley Summit produced a declaration signed by twenty-eight countries — and no enforcement body. The EU AI Act built mandatory requirements for high-risk applications and then wrote explicit carve-outs that suspended those requirements wherever they would actually be tested. These are not the design choices of a governance architecture built to hold. They are the design choices of one built to be seen.
The term that describes what this architecture actually is: oversight theatre. Not governance failure — failure implies something was built to work and broke. Oversight theatre describes something more deliberate: the systematic production of oversight that is designed to appear load-bearing rather than to be load-bearing. Documents that sound like constraints but contain no enforcement mechanism. Institutions that look like watchdogs but carry no bite authority. Commitments that perform accountability without creating it. The performance registers — the language, the panels, the frameworks, the summits — are real. The structural function they imply is not.
The obvious objection deserves a direct answer: voluntary frameworks have produced real compliance before. GDPR has teeth. Basel III restructured global banking capital requirements. Early internet standards coordinated behaviour across competing institutions with no enforcement body at all. If voluntary governance has worked in those domains, why not here?
The answer is in the conditions that made those cases work. GDPR succeeded because the EU had jurisdictional authority over companies operating in its market and the political will to levy fines that exceeded the cost of non-compliance — it was voluntary in name, statutory in practice. Basel III worked because the banks subject to it shared an interest in systemic stability and faced supervisors with real inspection access. Internet standards worked in a domain where no single actor had the power or incentive to unilaterally defect from interoperability. None of those conditions exist in the AI national security domain. The clients applying pressure here are not subject to the same governance architecture they’re reshaping. They are the architecture. When the entity that can override your framework is also the entity the framework is supposed to constrain, the framework isn’t a constraint. It’s a request.
It’s a coherent narrative. It just doesn’t survive the first serious pressure test. The moment power actually wants something that conflicts with the framework, the framework bends. What’s left standing, as we’re about to see, is mostly optics.
Anthropic: When the Red Line Met a Real Client
To understand what voluntary guardrails actually are, it helps to watch them fail in detail.
Anthropic’s Responsible Scaling Policy was the company’s flagship safety commitment — the document that was supposed to prove that you could build powerful AI while maintaining principled constraints. It included explicit commitments around catastrophic risks, promised to pause capability development if certain danger thresholds were crossed, and named two explicit red lines it would not cross: deployment for mass domestic surveillance and fully autonomous weapons systems — with both stated explicitly in Dario Amodei’s public communications as the commitments Anthropic would hold regardless of client pressure. For a company whose entire identity was built on the proposition that safety and capability are not in conflict, this document was the cornerstone.
Then the United States Department of Defense decided it wanted Claude for exactly the things the document was designed to prevent. The demands centred on “all lawful purposes” — a phrase that sounds almost bureaucratically neutral until you understand what “lawful” means inside the U.S. national security and intelligence apparatus. It means mass surveillance conducted under classified executive orders. It means targeting decisions made by algorithmic systems with legal cover for “incidental” civilian harm. It means a definition of “human oversight” elastic enough to include one analyst nominally supervising thousands of automated outputs per hour. “Lawful,” in the national security context, is not a constraint. It’s a permission structure.
Dario Amodei publicly rejected the Pentagon’s specific demands at the time. What happened next, however, was instructive: the Responsible Scaling Policy was revised anyway. Not by Anthropic’s safety team recommending stronger protections. By the same pressure dynamic that always operates when a small company holds something a large institution wants.
The assessment wasn’t limited to internal observers. Chris Painter, policy director at METR — the independent AI evaluation nonprofit that had reviewed the RSP draft with Anthropic’s own permission — described the revision in direct terms: the change showed Anthropic “believes it needs to shift into triage mode with its safety plans, because methods to assess and mitigate risk are not keeping up with the pace of capabilities.” An independent evaluator embedded in the safety ecosystem, describing the collapse of the document they had helped assess. The verdict from outside the lab matched the verdict from inside it.
The revised language replaced specific constraints with “aspirational objectives” and “transparent assessment” commitments — the difference between a load‑bearing wall and a decorative partition.
Anthropic is not uniquely cowardly. One detail from the negotiation record is worth holding onto for a moment. The DoD’s position was not simply that it wanted more capability. Its argument was that the specific prohibitions Anthropic had written into the RSP were unnecessary — that mass surveillance and autonomous weapons deployment were already covered by existing law, and therefore didn’t need to appear as explicit constraints in any contract. That argument, delivered as reassurance, is the same mechanism the next section of this essay diagnoses in the OpenAI arrangement: safety commitments dissolved not by being rejected, but by being absorbed into a legal dialect that renders them functionally empty while appearing to honour them.
This outcome was predictable from first principles. A company with no statutory protection for its safety commitments, dependent on revenue, facing a client who can threaten its federal contracting future and invoke national‑security law to override almost any private agreement, has very limited options. The RSP failed not because Anthropic lacked conviction but because voluntary pledges were never the right tool for this job. They were always going to lose this fight. The problem is that we were told they were the guardrail.
OpenAI: Safety by Contract and the Art of the Loophole
If Anthropic’s story is about guardrails that collapsed under explicit coercion, OpenAI’s is something more structurally interesting — and more disturbing. It is not a story of principles abandoned under pressure. It is a story of safety language that was present at every stage and functional at none of them.
The context matters. When the Pentagon terminated its Anthropic contract and the blacklisting memo circulated, OpenAI moved fast — not cautiously, not carefully, but competitively. Wired reported that Dario Amodei, in an internal memo, accused Altman of deliberately moving to fill the Pentagon gap in ways that made it harder for Anthropic to hold its line. The competitive incentive was explicit: be the lab that solved the Pentagon’s problem before the moment passed. That is the context in which the contract’s safety architecture was assembled. It was not designed to be the strongest possible framework. It was designed to be finished first.
Sam Altman announced the deal within days, framed publicly as proof that OpenAI, unlike Anthropic, could maintain safety principles while serving national security clients. The stated red lines: no use for mass domestic surveillance of Americans, no use for fully autonomous weapons systems, no use for high-stakes automated decisions like social credit systems. Altman claimed the arrangement had “more guardrails than any previous agreement for classified AI deployments, including Anthropic’s.” A government-contracts law expert at George Washington University, Jessica Tillipman, described what she observed as a contract for military weapons use being negotiated live on social media — as OpenAI employees pushed back publicly, in real time, on the terms Altman had just announced. Sam Altman subsequently acknowledged the process had been “sloppy.” OpenAI employee Leo Gao went further, publicly describing the contract’s guardrails as “not really operative except as window dressing.” This is not a critic outside the company. This is someone inside the organisation, in the language the essay’s argument has been building toward.
The language itself explains why. The commitments prohibit intentional and deliberate domestic surveillance of Americans. Those two words are not arbitrary. They are load-bearing in a specific legal sense — and the weight they carry points in the opposite direction from the one the contract implies.
Surveillance of Americans under Executive Order 12333 — the primary legal authority governing U.S. intelligence collection of foreign intelligence — is legally premised on American exposure being incidental, not intentional. That is the entire doctrinal basis on which bulk collection operates: the government is targeting foreign actors; Americans swept in are, legally, incidental. The prohibition on “intentional” domestic surveillance does not touch this scenario. Nor does the original contract’s “no direct mass surveillance” formulation: a system that ingests commercially purchased location data, communication metadata, and behavioural profiles on millions of Americans to build targeting packages for intelligence operations is not directly surveilling anyone. The data was purchased. The AI is pattern-matching. The American exposure is incidental. TechPolicy.Press, Citizen Lab, and legal scholars at the Center for American Progress all identified the same gap, independently, within the same news cycle. The architecture of the loophole required no special access to see. It was visible to anyone who read the operative terms against the legal authorities they were written into.
Under immediate public pressure, OpenAI amended the deal. The amendments added explicit language prohibiting “intentional” domestic surveillance of U.S. persons and excluded intelligence agencies — including the NSA — from the contract’s scope without a separate follow-on agreement. Altman posted on X saying the previous version had been a mistake. The process had moved too fast. The issues were “super complex.”
TechPolicy.Press immediately identified what the amendments had not closed: joint task forces routinely include intelligence agency personnel; the contract contains no mechanism to prevent DoD entities from running tasks on behalf of intelligence agencies; the NSA exclusion has no enforcement architecture. The surveillance capability the original contract opened was not closed by the amendments. It was slightly narrowed, with its perimeter still undefined and its edge cases unresolved.
This is what the essay has been calling safety as a dialect — a separate language, spoken in contract law and national-security regulation, that sounds like ordinary English but operates differently. When OpenAI says “we have guardrails,” and those guardrails are expressed in this dialect, the statement can be simultaneously technically accurate and politically false. The system can contribute to exactly the harms the guardrails were supposed to prevent. Every actor in the chain can truthfully say: “We honoured our commitments.” Nobody violated the contract, because the contract was written in a language designed to accommodate what it claimed to prohibit.
The distinction between Anthropic’s case and OpenAI’s case is not that one company held its principles and the other abandoned them. It is that they demonstrate the same failure mode at different stages. Anthropic wrote explicit prohibitions, held them under pressure, and lost — the prohibitions were revised away. OpenAI wrote language that appeared equivalent but used terms of art already defined, by the relevant legal authorities, into near-irrelevance. In both cases the safety architecture performs accountability without creating it. The difference is only in where the seam shows.
Safety Institutes are Oversight Theatre in Its Purest Form
The institutional layer around the labs was supposed to be the check on the labs. Independent bodies, government-convened or otherwise, whose job was to evaluate, benchmark, and advise — standing between the labs and deployment, with enough credibility to mean something. That is what the official story said. The record of what actually happened is more instructive.
Start with what these institutes were designed to do. AI Safety Institutes produce evaluation frameworks, benchmark models, publish best-practice guidance, and advise governments on risk classification. This work has genuine value in a specific and limited domain: accidental capability failures. Where a model might hallucinate, produce dangerous instructions, or exhibit unexpected behaviours during normal consumer use, institutes can identify patterns, flag risks, and nudge labs toward better practice. That is a real function. It is also, in the context of this essay’s argument, the least consequential domain in which AI is deployed.
What these institutes cannot do is enforce anything. They have no statutory veto power over deployment decisions. They cannot compel a lab to halt a contract. They cannot inspect classified deployments. They cannot impose penalties. The NIST AI Risk Management Framework — the closest the U.S. has to a national AI governance standard — describes itself as “intended for voluntary use.” That phrase is not a bureaucratic hedge. It is a precise description of the framework’s legal status: a preference, not a constraint. Every institute in the ecosystem operates under the same condition. Advisory authority in a domain where the consequential decisions are made by defence ministries and intelligence agencies operating under rules the institutes were never given authority to touch.
The Future of Life Institute’s AI Safety Index — the most systematic external attempt to evaluate whether labs are actually meeting their safety commitments — reached a conclusion that deserves to be read plainly: self-regulation is not working. Most major labs lack meaningful independent tripwires — external mechanisms that would actually halt a deployment if safety criteria were not met. Their evaluations are largely self-reported. Their red lines are not attached to enforceable pause conditions. There is, functionally, no independent third party with both the access and the authority to say “stop” before a deployment happens. The Index found this across labs, across jurisdictions, across the entire ecosystem of voluntary safety commitments. The problem is not that one lab is non-compliant. The problem is structural.
The most direct evidence of what happens to safety institutes when they become politically inconvenient is not an analysis — it’s a sequence of documented decisions. In February 2025, the UK government renamed its AI Safety Institute to the AI Security Institute, dropping “safety” from the name and explicitly narrowing its mandate away from bias, transparency, and broader ethical concerns toward cybersecurity and criminal applications. The fact-checking organisation Full Fact described the rebrand as a “downgrade” of ethics standards. Four months later, the Trump administration renamed the U.S. AI Safety Institute the Center for AI Standards and Innovation, reorienting it toward national competitiveness and deregulation. The word “safety” was removed from the name of the institution whose entire purpose was AI safety. TechPolicy.Press noted the obvious: this is not semantics. The renaming marks a pivot between two competing visions for AI governance — one that emphasises long-term risk mitigation and public accountability, and one that prioritises innovation, speed, and national competitiveness. The second vision won. It won without a policy debate, without a legislative process, and without the public being asked.
This is oversight theatre in its purest form: serious-looking PDFs, carefully convened panels, impressive benchmarks — none of which change anything when a billion-dollar contract is on the table and a general is in the room. And when even the aesthetic of safety becomes inconvenient, the institutions that produced it get quietly renamed.
The National Security Skeleton Key
Here is the deeper structural problem — the one that explains not just that the safety architecture fails, but why it could never have been built any other way.
Every major AI governance framework — the EU AI Act, U.S. executive orders, international safety agreements, even the companies’ own internal policies — contains broad carve‑outs for national security and defence. The mechanism isn’t subtle. The EU AI Act’s Article 2(3) explicitly suspends the regulation’s mandatory requirements for national security, defence, and military purposes — not as an edge case, but as a foundational design choice. The most consequential AI applications on earth are expressly outside the scope of the most comprehensive AI governance legislation on earth. That is not a loophole. It is the architecture.
Understanding why requires asking who was in the room when these frameworks were written, and what their actual interests were. Safety institutes evaluate consumer-facing models. Governance toolkits certify compliance with voluntary standards. But national security agencies — the NSA, GCHQ, the Defence Intelligence Agency, the agencies that sit behind the classified procurement contracts — operate under legal authorities that predate AI by decades: FISA, Executive Order 12333, classified annexes to defence appropriations bills. These authorities were built to be invisible to external auditors. They were designed to operate in classified environments no safety institute can see, no civil society researcher can subpoena, and no international framework can inspect. When the AI governance architecture was being built, these actors did not ask for a carve-out because they needed one. They received one because no government involved was willing to write rules that actually applied to themselves.
The result is a safety regime that is strongest exactly where it matters least — consumer applications, productivity tools, low‑stakes automation — and structurally absent exactly where it matters most. You can audit an AI‑generated insurance recommendation. You can benchmark a chatbot for bias. You cannot audit an AI‑assisted targeting decision made inside a classified military network under a contract whose terms are themselves classified. You cannot inspect whether the “intentional” surveillance prohibition the OpenAI contract contains is being honoured when the entire collection apparatus operates under an authority that defines American exposure as incidental. The same legal architecture that made the OpenAI contract loopholes possible — EO 12333, the FISA framework, the classified procurement vehicle — is the same architecture the safety governance system was structurally prevented from reaching.
This is not a gap that better governance design would close. The carve-outs are not oversights waiting to be corrected. They were placed deliberately by the same governments that convened the Bletchley Summit, signed the declarations, and stood up the institutes. The safety architecture and the national security override were built simultaneously, by the same actors, for the same reason: to produce the appearance of governance in the domain where governance is politically safe, and to preserve operational freedom in the domain where it actually matters. The two are not in tension. They are designed to coexist.
Any guardrail with a “national security override” is not a guardrail. It is a suggestion addressed to the one actor in the system who was never going to follow it.
What a Real Guardrail Would Look Like
It should be said plainly: voluntary corporate policies, industry frameworks, and advisory institutes are not intrinsically useless. In a world where the primary risks from AI were accidental capability failures and commercial misbehaviour, they would be reasonable first steps. That is not the world we are in.
We are in a world where the formal governance architecture has been stress-tested by state power and failed — visibly, documentably, within a single news cycle. So the right question is not what better voluntary frameworks would look like. It is what is actually functioning as AI governance right now, in the absence of anything structurally adequate.
The answer the record gives is uncomfortable. The closest thing to a functioning check on the worst deployments has not been an institute, a framework, or a policy document. It has been engineers and executives who decided, at personal cost, that they would not participate. Caitlin Kalinowski leaving OpenAI rather than continue leading a hardware and robotics program being steered toward deployment terms she concluded were ungoverned. The engineers from Google and OpenAI who signed the cross-company “We Will Not Be Divided” letter — explicitly naming the government’s strategy of playing labs against each other as the mechanism they were refusing to serve. At the time of writing, that letter carried 666 verified signatures: 573 from Google, 93 from OpenAI. That number will have changed by the time you read this; the letter remains open. Whether or not the universe chose that figure deliberately, it is a number that carries considerable cultural baggage — and as a headcount for who is currently serving as civilisation’s primary AI safety mechanism, it is, at minimum, on-brand. Individual conscience, exercised at individual cost, with no statutory protection, no institutional backing, and no guaranteed outcome.
What did it produce? Kalinowski’s departure and the employee pressure that accompanied it contributed to OpenAI amending the contract — adding the “intentional” surveillance prohibition, adding the NSA exclusion, triggering the public acknowledgment that the original deal had been rushed. That is a real effect. It is also exactly the amended contract this essay already diagnosed in the OpenAI case study: the version that prohibits “intentional” surveillance while leaving the incidental collection architecture untouched, that excludes the NSA while leaving the enforcement mechanism undefined. Individual conscience, at its most effective, produced the best available outcome — and the best available outcome is a document riddled with the precise loopholes that make it compatible with the harms it was supposed to prevent.
That is the structural verdict on individual conscience as governance: it doesn’t stop deployments. At best, it reshapes the language that surrounds them. And the language, as the OpenAI contract analysis showed, is the problem.
A real guardrail looks different from everything described in this essay. It requires binding statute that no procurement relationship can override. It requires inspection rights that reach classified deployments, not just consumer products. It requires hard prohibitions — on autonomous lethal targeting, on dragnet citizen scoring, on warrantless AI-assisted surveillance — expressed in plain language with enforcement architecture attached, not in a legal dialect that accommodates what it claims to prohibit. It requires whistleblower protection strong enough that raising an alarm doesn’t mean career destruction is the guaranteed outcome.
None of those things exist in any jurisdiction today at meaningful scale. What exists instead is individual conscience — unscalable, unreproducible, and, as the record shows, capable only of producing the amended language rather than the structural accountability. At best, it is a holding action while the real architecture gets built. At worst, it is the thing we point to when we need to believe the system is working.
A civilisation that depends on individual conscience inside private companies as its primary AI safety mechanism has already failed at governance.
The Questions That Remain
Because the safety mirage is going to keep being constructed — every time a new lab launches a new policy, every time a government stands up a new institute — your reader needs a set of instincts that cut through the aesthetics quickly.
When someone tells you “we’ve put guardrails in place,” there are three questions that strip the claim down to what it is:
Who can override them? If the answer is “a sufficiently large client” or “national security classification” or “executive order,” the guardrail is decorative. This essay has shown you two documented cases in which that override was exercised. Neither required a crisis. Both happened in a single procurement cycle.
What actually happens when they’re violated? If the answer is “the company publishes a transparency report” or “the institute will note it in next year’s index,” the guardrail has no mechanism. A constraint with no consequence is a preference.
Who, outside the lab and the client, can see the logs? If the answer is “nobody with inspection rights and enforcement authority,” then whatever is being described is private self-assessment — the same model that allowed every prior industry to certify its own safety until the bodies piled up.
These are not hostile questions. They are the minimum due diligence of a literate citizen in the AI age. Apply them to every framework, every institute, every policy announcement you encounter. Most will fail all three.
The Anthropic–Pentagon standoff and the OpenAI surveillance question are not edge cases. They are previews of the central conflict of the next decade: between the formal AI safety architecture, which is designed to manage commercial risk and accidental capability failures, and the actual uses to which the most powerful AI systems are being deployed, which are military, surveillance, and governance applications where formal safety architecture was never meant to reach.
The mirage doesn’t need to fool everyone. It just needs to hold long enough for the stack to harden.




