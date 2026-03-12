As the Pentagon increasingly relies on artificial intelligence, a question has arisen: How can one be sure that the AI models are working the way they should?

The best way is to test new AI before users get their hands on it. So, the Defense Department — along with the Office of the Director of National Intelligence — is seeking a system that can test whether AI models meet specified criteria.

“As artificial intelligence (AI) capabilities evolve at an extraordinary pace, the government requires evaluation infrastructure that can keep pace by continuously assessing new models against mission-specific benchmarks as they are released,” according to an Area of Interest announcement from the Defense Innovation Unit.

DOD also wants to ensure that AI and humans work well together. “Evaluation must assess not only whether AI systems can perform tasks in isolation, but whether human-AI teams achieve better mission outcomes than either humans or AI alone,” the announcement said.

DIU envisions a “harness” with a standard, pluggable architecture that can test any AI — developed by any contractor — and provide a consistent, structured evaluation. This includes studying workflows across different environments, safely auditing AI agents and allowing human experts to assess “human workload, usability, and mission performance across human-only, AI-only, and human-AI team scenarios.”

The harness should also test whether the AI can function amid chaotic, low-information conditions. The system must simulate “operational stress and network degradation in a controlled, reproducible environment,” DIU said.

Also evaluated will be whether enemy AI can hijack or confuse friendly AI models. The system must support “automated red-teaming, including the execution of adversarial prompts and attack patterns.”

AI will be assessed against a variety of benchmarks. They include “identifying what capabilities matter for a given mission context” and breaking down complex AI capabilities into smaller, measurable tasks. Results should be clear, including establishing what constitutes a good score for an AI, and delivered in a format that is “easily understood and can be acted upon by decision makers.”

DIU was also careful to note that the evaluation system must be fair, with “no systemic advantage to particular architectures or vendors.”

The deadline is March 24.

Michael Peck is a correspondent for Defense News and a columnist for the Center for European Policy Analysis. He holds an M.A. in political science from Rutgers University. Find him on X at @Mipeck1. His email is mikedefense1@gmail.com.