JATIC T&E Software Tool Prototyping

The Chief Digital and Artificial Intelligence Office (CDAO) Test and Evaluation (T&E) Directorate supports the testing of a variety of Artificial Intelligence (AI) and Machine Learning (ML) applications throughout the Department of Defense (DoD). To enable AI testing throughout the DoD, CDAO T&E is funded to develop the Joint AI Test Infrastructure Capability (JATIC), a suite of interoperable software tools for comprehensive AI model T&E. By providing deeper understanding of AI model performance through testing, JATIC will support the DoD's broader missions of accelerating the adoption of AI and ensuring the development of Responsible AI.

01/03/2023 at 12:00 PM EST

20 December 2022 Update: See changes/clarifications highlighted below in 'How You Can Participate'

15 December 2022 Update: The government stakeholders will hold an open call for all Industry to ask questions regarding the CDAO T&E Software Tool Prototyping challenge. The call will be open from 1430 - 1530 ETon 19 December 2022; dial-in information is below. Again, this call is open to any questions from Industry relating to the challenge.

The questions posted in the challenge via the Portal will continue to be monitored and answered as well, up until the closing of the challenge.

There will not be a summary provided of the 19 Dec call's questions and answers, but if the challenge changes based on the call's Q&A, an update will be posted in the challenge on the Portal.

Dial-in Information:

+1 410-874-6749,,566330287#

Phone Number: 410 874 6749; Phone Conference ID: 566 330 287#

The following DSN numbers may also be used: East DSN 322-874-6739 | West DSN 322-874-6749

The purpose of this Call to Industry is to select vendors for the development of prototypes for JATIC in FY23. Successful prototypes will be matured and productionized.

This Call to Industry is focused on the T&E of AI models for Computer Vision (CV) classification and object detection problems. All of the below information refers exclusively to T&E of AI models for CV classification and object detection. CDAO acknowledges that there are capability gaps within other areas of T&E, but these other areas are considered out of scope for this call. In particular:

Other AI modalities besides CV, such as autonomous agents, natural language processing, etc. are out of scope.
Other areas of T&E, including systems integration T&E, human-machine T&E, and operational T&E, are out of scope.
Monitoring of AI model performance at the edge is out of scope.
Modeling & simulation environments are out of scope.

These other areas will be addressed through other programs or in future years of the JATIC program.

There is widespread interest throughout the DoD for enterprise-level T&E capabilities to address the novel challenges posed by the T&E of AI. While many T&E capabilities have been developed by previous DoD AI Programs to meet the specific needs of those programs, their usage and distribution throughout the Department has been limited by several key factors, including:

Lack of capabilities for advanced AI T&E functions, such as robustness testing.
Lack of maturity, scalability, reproducibility, ease of use, or cyber-hardening of capabilities.
Difficulty deploying and using capabilities within DoD environments.
Difficulty using capabilities within ML pipelines and technology stacks.
Difficulty using capabilities in an integrated T&E pipeline, as each individual capability often requires custom model formats or data wrangling.
Inability for capabilities to evolve with developments in AI research.

These limitations, in addition to other factors, such as a lack of common standards and best practices for AI T&E, have inhibited the ability for DoD AI programs to perform comprehensive T&E of their AI models.

Within its minimum viable product (MVP), JATIC seeks to develop T&E capabilities within the following dimensions of AI T&E:

Dataset Analysis

Dataset analysis capabilities allow T&E stakeholders to understand held out T&E datasets, including their quality and similarity to operational data. This dimension is focused on assessing properties of a dataset that are independent of a particular AI model under test. Example functionalities include (but are not limited to) the computation of:

Dataset quality (e.g., label errors, missing data)
Dataset sufficiency for a given test (e.g., number of samples, variation)
Biases, outliers, and anomalies in the dataset, which may be naturally occurring or intentionally inserted (i.e., data poisoning)
Comparison of two datasets (e.g., divergence of dataset distributions)

Model Performance

Model performance capabilities allow T&E stakeholders to assess how well an AI model performs on a labeled dataset and across its population subclasses. Example functionalities include (but are not limited to):

A comprehensive set of well-established CV metrics (e.g., precision, recall, mean average precision)
Metrics to assess probability calibration, i.e., the reliability of the model's measure of predictive uncertainty (e.g., expected calibration error, entropy, reliability diagrams)
Metrics to assess fairness and bias in model output across the test dataset (e.g., statistical parity)

In addition to pre-defined metrics, capabilities will support the creation of custom, user-defined metrics.

Model Computational Performance

Model computational performance capabilities allow T&E stakeholders to measure the computational efficiency, resource usage, and scalability of an AI model. Example functionalities include (but are not limited to) the computation of:

Model throughput, latency, resource usage, and scalability, as well as constraints on these properties
Model performance across different hardware configurations
Optimal batch sizes for model inference

Model Analysis

Model analysis capabilities expand upon the model performance dimension by allowing T&E stakeholders to gain a deeper understanding of a model's capabilities and limitations by uncovering hidden trends, patterns, or insights across the input space. Example functionalities include (but are not limited to) algorithms that:

Detect trends in model performance across the entire dataset, such as distinct types of model errors
Identify clusters of the model input space based on model predictions, feature values, network activations, etc.
Identify potentially mislabeled ground truth data based on sets of model predictions
Determine under-tested or high-value regions of the input space (for the model under test) to inform future test data collection or labeling

Natural Robustness

Natural robustness capabilities enable T&E stakeholders to determine how natural corruptions in data can impact model performance. These corruptions emulate realistic noise that may be encountered within the operational deployment environment. Example corruptions include (but are not limited to):

Pre-sensor, environmental or physical corruptions (e.g., fog, snow, rain, changes in target shape or dimensions)
Sensory corruptions (e.g., out-of-focus, glare, blur)
Post-sensor, in-silico corruptions (e.g., Gaussian noise, digital compression)

Capabilities for creating these corruptions may leverage synthetic data generation techniques. Capabilities will provide users with the ability to control the severity level of the corruption, so that performance can be reported as a function of severity level and T&E stakeholders can use these results to set or assess requirements on performance and robustness.

Adversarial Robustness

Adversarial robustness capabilities enable T&E stakeholders to assess how adversarial corruptions on data inputs may impact model performance. The primary focus of this dimension will be on evasion-style attacks (i.e., attacks in which the adversary aims to manipulate the input data to produce an error in the model output, often covertly). Example types of attack methods include (but are not limited to):

White-box and black-box methods
Mathematical (e.g., Lp norm-constrained) and physically realizable (e.g., patch) attacks
Empirical and certified attacks

In addition to pre-defined attacks, capabilities will provide lower-level building blocks to enable users to create their own adaptive attacks that can be customized to the specific AI model under test. Capabilities will provide users with the ability to control the attack severity and assess performance given varying assumptions on adversary knowledge and adversary capabilities.

JATIC will be comprised of several independent but mutually compatible software libraries for AI T&E which provide capabilities in the above dimensions. JATIC software will be usable within many different T&E workflows and ML technology stacks to support wide adoption throughout the DoD. For instance, JATIC libraries will be usable independently within Jupyter Notebooks to enable quick model testing and metrics visualization (e.g., gradio, streamlit), and also within production-level ML pipelines - leveraging features such as test automation, experiment tracking, and visualization dashboards - to enable powerful new insights into model performance. In addition, while JATIC is primarily focused on supporting T&E of models after training, the software will be applicable to many other phases of the AI lifecycle, such as model training and hyperparameter optimization.

Below, we provide technical details on the desired end state of these software libraries.

Software Specifications

Software will be usable as libraries within python environments.
Software will be straightforward to run on a wide range of deployment environments, including all major operating systems, cloud deployments, etc.
Where appropriate, software will allow users to add functions (e.g., custom metrics, perturbations, adversarial attacks) in an extensible way.
Where appropriate, software may build upon existing open-source libraries, increasing their maturity, compatibility, or relevancy for DoD usage.
Software will be designed to enable continual evolution with the state-of-the-art in AI T&E and updates in ML technologies. In particular, software will be designed to enable straightforward contributions from developers in the community.

Software Compatibility

Software must address compatibility and integration. Example ML technologies to consider compatibility with include (but are not limited to):
- ML frameworks: PyTorch, TensorFlow
- ML pipelines: Databricks, SageMaker, Vertex
- Compute engines: Kubernetes, Spark, Ray
- Orchestrators: Airflow, Kubeflow, Pachyderm
- Scalability frameworks: Horovod, PyTorch Lightning
Development shall place heavy emphasis on diligent API design in order to enable straightforward compatibility with a wide range of technologies.

Intellectual Property

Government will have at least full rights to software developed within this award, including access, distribution, and modification. The government is not interested in licensing; any solution that requires licensing or has more limited intellectual property rights will not be considered.

Collaboration

Interested parties will collaborate with the government and other developers in order to create a set of interoperable T&E capabilities. In particular, API design and dependency management will require detailed collaboration to ensure compatibility.

The total award amount is projected to be $12MM.

This amount will be distributed at the government's discretion across allthe AI T&E dimensions mentioned above, to strongest vendor(s) within each area. It is not necessarily the case that a single vendor will be selected for each dimension. The government may select a single vendor for multiple dimensions, multiple vendors for a single dimension, or may choose not to select any vendors for a given dimension. In particular, if no responses demonstrate improvements from existing state-of-the-art capabilities in a given dimension (value proposition topic above), no vendors may be chosen.

Round 1: Discovery

Round 1 is a review to down-select based on the technical strength of written responses. Round 1 is evaluated to explore expertise available (who understands the problem set the best?) and find technically excellent ideas (who has an effective approach to solving the problem?). Review during round 1 is highly subjective. After an initial evaluation of written responses, select parties will be invited to deliver a presentation to a government panel on their proposed technical approach.

Round 2: Technical Presentation

This is your chance to present your technical approach to the government panel. Participants will have 45 minutes to present and 15 minutes for questions from the panel. Questions will be relevant to the information provided in the presentation and may differ among presenters. There is no formal agenda or schedule to the demo outside of the time constraint. The Presenter has complete control over the structure and format of the 45-minute presentation. Any organizational descriptions and introductory company remarks should be kept to a minimum.

The Government panel will possess a high level of familiarity with AI T&E; thus, the presentation should clearly address the technical details of the approach. If invited to this phase, additional logistical information (i.e., time, date, and dimensions of interest) will be provided separately.

Special Notice on Judges:

Non-Government, subject matter expert (SME) judges may be used any assessment activity. Such assessors will be operating at the direction of the Government and through signed non-disclosure agreements (NDAs). The Government understands that information provided in response to this challenge is presented in confidence and it agrees to protect such information from unauthorized disclosure to extent required by law. Your participation in any round of assessment under this Challenge indicates concurrence with the use of Non-Government SME judges.

Submissions must be made by registering for this challenge.

Interested parties may respond to one or more of the AI T&E dimensions listed above, but only one submission per dimension per vendor will be accepted (12/20/22 update). For each dimension, interested parties must submit a separate written response. Within the response, please clearly state which dimension the response is addressing. Additionally, all submissions must include a statement of how the solution complies with the IP parameters required by the challenge (12/20/22 update). Each response should be organized to address the following five topics, with no more than 1 page per topic listed below (not including figures):

Functionality
1. What technical T&E functionalities will your capability provide?
2. Describe why you have chosen to include these functionalities and how they address the selected AI T&E dimension.
Value Proposition
1. Describe how your capability will improve upon the status quo within this AI T&E dimension.
2. Provide specific examples of the current state-of-the-art capabilities, their limitations, and your proposed improvements.
Software Engineering
1. Describe the high-level software architecture and design of your capability.
2. Will your capability build upon any existing open-source solutions?
3. What programming language(s) will your capability be written in?
Compatibility & Performance
1. Describe how your capability will support compatibility with the ML technologies and pipelines listed in the desired end state, particularly in API and interface design.
2. Describe how your capability will support high-performance and scalable usage.
3. Describe how your capability will balance between wide compatibility and powerful T&E functionality.
Collaboration
1. Describe your approach to API design and dependency management when constrained by other developers and external requirements.
2. Describe your approach to working with other vendors to develop software.
3. Describe your approach to facilitating straightforward contributions from developers in the community.

Answers to the above questions are not expected to be comprehensive; the challenge for interested participants in Round 1 is to keep the Government reviewer interested in the content of submission enough to for the reviewer to want to learn more via a Round 2 invite. CDAO is interested in all levels of solution maturity (i.e. from extremely immature (research-level) to very mature solutions) in responses (12/20/22 update). After initial evaluation of written responses, selected parties will have the opportunity to present their technical approach and provide additional details. For Round 1, other information, such as company background, should be provided only briefly, if at all.

Opportunity Details

More Opportunities

JATIC T&E Software Tool Prototyping

Opportunity Summary

Description

Additional Information

Submission Deadline:

Background & Scope

Problem Statement

AI T&E Dimensions

Dataset Analysis

Model Performance

Model Computational Performance

Model Analysis

Natural Robustness

Adversarial Robustness

Desired End State

Software Specifications

Software Compatibility

Intellectual Property

Collaboration

Reward

Selection Rounds

How You Can Participate

Point of Contact