Scale AI to Develop T&E Framework for Pentagon’s Large Language Models

February 21, 2024 2:12 pm UTC+0

February 21, 2024 written by Maxwell Ellis

Summary:

Pentagon’s CDAO partners with Scale AI for a one-year project.
Aims to develop a robust testing and evaluation (T&E) framework for large language models (LLMs).
Focus on deploying AI safely for military planning with real-time feedback.
Unique “holdout datasets” and model cards to be created for iterative evaluation.
Emphasis on automation for continuous assessment.
CEO Alexandr Wang highlights the importance of responsible deployment and understanding generative AI’s strengths and limitations.

Scale AI to Develop T&E Framework for Pentagon’s Large Language Models

In a strategic move, the Pentagon’s Chief Digital and Artificial Intelligence Office (CDAO) has chosen Scale AI to pioneer a robust testing and evaluation (T&E) framework for large language models (LLMs). This initiative seeks to address the potential of generative AI to influence military planning and decision-making. The San Francisco-based company has exclusively shared details of this one-year contract with DefenseScoop.

The comprehensive framework aims to provide the CDAO with a reliable means to deploy AI safely, offering real-time feedback for warfighters and creating specialized evaluation sets for military applications. These evaluation sets will specifically focus on organizing findings from after-action reports, contributing to the refinement of AI models tailored for military support.

The generative AI field, encompassing large language models, holds promise for the Department of Defense but presents challenges due to the lack of universally set AI safety standards and policies. To tackle this, the Pentagon established Task Force Lima within the CDAO’s Algorithmic Warfare Directorate last year.

Scale AI intends to apply a T&E approach similar to traditional methods but adapted for the unique characteristics of large language models. Given the generative nature of these models and the complexity of evaluating the English language, the company plans to create “holdout datasets” involving DOD insiders to prompt responses. These responses will be adjudicated through layers of review to ensure quality comparable to human military standards.

The iterative process includes refining datasets relevant to the DOD, assessing existing large language models against them, and creating model cards—documents providing details on the context for the best use of machine learning models and information for measuring their performance.

Automation is a key focus, with the goal of establishing a baseline understanding of new models’ performance characteristics. The intent is for models to send signals to CDAO officials, alerting them if they deviate from the domains they have been tested against.

Scale AI’s statement emphasizes that this work will contribute to maturing T&E policies related to generative AI, enabling the adoption of large language models in secure environments. The company has expressed its honor in partnering with the DOD on this groundbreaking framework. Beyond the CDAO, Scale AI has established partnerships with Meta, Microsoft, the U.S. Army, OpenAI, and other industry leaders.

Alexandr Wang, Scale AI’s Founder and CEO, highlighted the importance of testing and evaluating generative AI to understand its strengths and limitations responsibly. The collaboration with the DOD underscores Scale AI’s commitment to advancing the field and supporting the deployment of AI technologies in a secure and controlled manner.