Summary
On May 5, 2026, the Center for AI Standards and Innovation (CAISI) — the U.S. Department of Commerce laboratory formerly known as the US AI Safety Institute — announced pre-deployment evaluation agreements with Google DeepMind, Microsoft, and xAI. The agreements bring the total number of frontier AI developers under formal CAISI evaluation arrangements to five, adding to existing partnerships with Anthropic and OpenAI that had been renegotiated to reflect CAISI's revised directives from the Secretary of Commerce and the America's AI Action Plan. CAISI has completed more than 40 evaluations to date, including assessments of models not yet publicly released.
What Happened
CAISI published its announcement on May 5, 2026, describing new agreements with Google DeepMind, Microsoft, and xAI to conduct pre-deployment evaluations and targeted research aimed at assessing frontier AI capabilities with a focus on national security risks. The agreements explicitly authorize testing within classified government environments and were drafted with provisions allowing the three companies to provide versions of their models with reduced or removed safety guardrails specifically to facilitate evaluation of dual-use and adversarial capabilities.
The announcement noted that evaluators from multiple federal agencies may participate in reviews through the CAISI-convened TRAINS Taskforce — a standing interagency group focused on AI national security concerns. The CAISI announcement described the new agreements as building on prior partnerships with Anthropic and OpenAI, both of which had been renegotiated following the Trump administration's reorganization of the office. The prior agreements had originated under the Biden administration's US AI Safety Institute; CAISI's mandates from the Secretary of Commerce and the America's AI Action Plan, issued after the administration change, modified the program's scope and directives while preserving the core pre-deployment access mechanism.
Kevin Hassett, director of the National Economic Council, had publicly characterized the CAISI evaluation model as analogous to FDA review of pharmaceuticals — a framing the administration was using to justify pre-deployment access without formally designating it as regulatory oversight. The CAISI operates without statutory authority to compel lab participation; all five agreements are voluntary. Axios reported that the White House's posture had shifted materially in the weeks preceding the announcement, partly in response to Anthropic CEO Dario Amodei's May 5 public statements citing Mythos Preview's ability to autonomously identify tens of thousands of high-severity vulnerabilities across major operating systems and browsers.
The expansion follows a sequence of policy actions during the spring of 2026. On April 7, Anthropic announced Project Glasswing, giving approximately 40 organizations — including CAISI's interagency partners in the NSA — access to the Mythos Preview model for defensive security work. On May 1, the Department of War (DoD) announced formal AI deployment agreements with eight companies for use on classified IL6/IL7 networks, explicitly excluding Anthropic. The CAISI expansion on May 5 is distinct from the DoD deployment agreements: it covers evaluation before release, not operational deployment, and is administered by the Commerce Department rather than the Pentagon.
The total confirmed laboratory participants in CAISI evaluation agreements as of May 5, 2026 are: Anthropic, OpenAI, Google DeepMind, Microsoft, and xAI. This set covers the five organizations most frequently identified as developers of frontier models with significant national security implications. No formal evaluation agreement had yet been announced with DeepSeek, Mistral, or other international frontier labs.
Why It Matters
The May 5 agreements constitute the fullest expression to date of what a voluntary US pre-deployment AI evaluation regime looks like in operation. Five of the largest frontier AI developers — representing the vast majority of the most capable publicly available and restricted models — have now formally agreed to provide CAISI with pre-release model access for national security assessments. The mechanism is entirely voluntary, relies on no statutory compulsion, and the terms of each agreement are not publicly disclosed. What is known is that developers provide models with reduced safeguards for government testing, that classified evaluation environments are available, and that the interagency TRAINS Taskforce participates in the reviews.
The expansion is notable for the specific addition of xAI. Elon Musk's AI company, whose CEO has been publicly hostile to AI safety governance frameworks and whose founder was simultaneously a plaintiff in federal litigation against OpenAI, entering a pre-deployment evaluation agreement with the federal government represents an accommodation that would have appeared unlikely twelve months prior. xAI's inclusion, alongside the continuation of Anthropic and OpenAI's agreements, effectively eliminates any credible argument that CAISI evaluations target a politically selective subset of the industry.
The broader policy context is a documented reversal. The Trump administration had, in early 2025, dismantled the Biden-era US AI Safety Institute by renaming it CAISI and restructuring its mandate. Critics characterized this as gutting federal AI safety oversight. The May 5 expansion — extending formal pre-deployment agreements to three additional companies — partially complicates that narrative, though it does not resolve the underlying questions: whether the agreements produce actionable safety assessments, whether evaluation findings lead to deployment conditions or delays, and whether voluntary pre-deployment testing is sufficient given that the five companies retain full discretion over what modifications, if any, to make before public release. The mechanism creates a documented record that a model was evaluated; it does not create a documented obligation to act on the findings.
§ How to read the metadata
- Landmark
- Fundamentally alters the trajectory; 2–5 per year.
- Major
- Meaningfully shifts the landscape; 2–4 per month.
- Notable
- Worth documenting; significance can be upgraded later.
- Confidence
- High = primary sources corroborate. Medium = credible secondary only. Low = provisional. Disputed = credible sources disagree.
- Contestation
- Uncontested = no formal challenge. Contested = at least one challenge open. Superseded = replaced by a later entry. Unresolved = dispute still open.