The Chief Digital and Artificial Intelligence Office (CDAO) Test and Evaluation (T&E) Directorate supports the testing of a variety of Artificial Intelligence (AI) and Machine Learning (ML) applications throughout the Department of Defense (DoD). To enable AI testing throughout the DoD, CDAO T&E is funded to develop the Joint AI Test Infrastructure Capability (JATIC), a suite of interoperable software tools for comprehensive AI model T&E. By providing deeper understanding of AI model performance through testing, JATIC will support the DoD's broader missions of accelerating the adoption of AI and ensuring the development of Responsible AI.
20 December 2022 Update: See changes/clarifications highlighted below in 'How You Can Participate'
15 December 2022 Update: The government stakeholders will hold an open call for all Industry to ask questions regarding the CDAO T&E Software Tool Prototyping challenge.
The call will be open from 1430 - 1530 ETon 19 December 2022; dial-in information is below. Again, this call is open to any questions from Industry relating to the challenge.
The questions posted in the challenge via the Portal will continue to be monitored and answered as well, up until the closing of the challenge.
There will not be a summary provided of the 19 Dec call's questions and answers, but if the challenge changes based on the call's Q&A, an update will be posted in the challenge on the Portal.
Dial-in Information:
+1 410-874-6749,,566330287#
Phone Number: 410 874 6749; Phone Conference ID: 566 330 287#
The following DSN numbers may also be used: East DSN 322-874-6739 | West DSN 322-874-6749
The purpose of this Call
to Industry is to select vendors for the development of prototypes for JATIC in FY23. Successful prototypes will be matured and
productionized.
This Call to Industry is
focused on the T&E of AI models for Computer Vision (CV) classification and
object detection problems. All of the below information refers
exclusively to T&E of AI models for CV classification and object detection.
CDAO acknowledges that there are capability gaps within other areas of T&E,
but these other areas are considered out of scope for this call. In particular:
- Other
AI modalities besides CV, such as autonomous agents, natural language
processing, etc. are out of scope.
- Other
areas of T&E, including systems integration T&E, human-machine
T&E, and operational T&E, are out of scope.
- Monitoring
of AI model performance at the edge is out of scope.
- Modeling
& simulation environments are out of scope.
These other areas will be
addressed through other programs or in future years of the JATIC program.
There is widespread interest throughout the DoD for enterprise-level T&E capabilities to address the novel challenges posed by the T&E of AI. While many T&E capabilities have been developed by previous DoD AI Programs to meet the specific needs of those programs, their usage and distribution throughout the Department has been limited by several key factors, including:
- Lack of maturity, scalability, reproducibility, ease of use, or cyber-hardening of capabilities.
- Difficulty deploying and using capabilities within DoD environments.
- Difficulty using capabilities within ML pipelines and technology stacks.
- Difficulty using capabilities in an integrated T&E pipeline, as each individual capability often requires custom model formats or data wrangling.
- Inability for capabilities to evolve with developments in AI research.
These limitations, in addition to have inhibited the ability for DoD AI programs to perform comprehensive T&E of their AI models.
Within its minimum viable product (MVP), JATIC seeks to develop T&E capabilities within the following dimensions of AI T&E:
Dataset analysis capabilities allow T&E stakeholders to understand held out T&E datasets, including their quality and similarity to operational data. This dimension is focused on assessing properties of a dataset that are independent of a particular AI model under test. Example functionalities include (but are not limited to) the computation of:
- Dataset quality (e.g., label errors, missing data)
- Dataset sufficiency for a given test (e.g., number of samples, variation)
- Biases, outliers, and anomalies in the dataset, which may be naturally occurring or intentionally inserted (i.e., data poisoning)
- Comparison of two datasets (e.g., divergence of dataset distributions)
Model performance capabilities allow T&E stakeholders to assess how well an AI model performs on a labeled dataset and across its population subclasses. Example functionalities include (but are not limited to):
- A comprehensive set of well-established CV metrics (e.g., precision, recall, mean average precision)
- Metrics to assess probability calibration, i.e., the reliability of the model's measure of predictive uncertainty (e.g., expected calibration error, entropy, reliability diagrams)
- Metrics to assess fairness and bias in model output across the test dataset (e.g., statistical parity)
In addition to pre-defined metrics, capabilities will support the creation of custom, user-defined metrics.
Model computational performance capabilities allow T&E stakeholders to measure the computational efficiency, resource usage, and scalability of an AI model. Example functionalities include (but are not limited to) the computation of:
- Model throughput, latency, resource usage, and scalability, as well as constraints on these properties
- Model performance across different hardware configurations
- Optimal batch sizes for model inference
Model Analysis
Model analysis capabilities expand upon the model performance dimension by allowing T&E stakeholders to gain a deeper understanding of a model's capabilities and limitations by uncovering hidden trends, patterns, or insights across the input space. Example functionalities include (but are not limited to) algorithms that:
- Detect trends in model performance across the entire dataset, such as distinct types of model errors
- Identify clusters of the model input space based on model predictions, feature values, network activations, etc.
- Identify potentially mislabeled ground truth data based on sets of model predictions
- Determine under-tested or high-value regions of the input space (for the model under test) to inform future test data collection or labeling
Natural Robustness
Natural robustness capabilities enable T&E stakeholders to determine how natural corruptions in data can impact model performance. These corruptions emulate realistic noise that may be encountered within the operational deployment environment. Example corruptions include (but are not limited to):
- Pre-sensor, environmental or physical corruptions (e.g., fog, snow, rain, changes in target shape or dimensions)
- Sensory corruptions (e.g., out-of-focus, glare, blur)
- Post-sensor, in-silico corruptions (e.g., Gaussian noise, digital compression)
Capabilities for creating these corruptions may leverage synthetic data generation techniques. Capabilities will provide users with the ability to control the severity level of the corruption, so that performance can be reported as a function of severity level and T&E stakeholders can use these results to set or assess requirements on performance and robustness.
Adversarial Robustness
Adversarial robustness capabilities enable T&E stakeholders to assess how adversarial corruptions on data inputs may impact model performance. The primary focus of this dimension will be on evasion-style attacks (i.e., attacks in which the adversary aims to manipulate the input data to produce an error in the model output, often covertly). Example types of attack methods include (but are not limited to):
- White-box and black-box methods
- Mathematical (e.g., Lp norm-constrained) and physically realizable (e.g., patch) attacks
- Empirical and certified attacks
In addition to pre-defined attacks, capabilities will provide lower-level building blocks to enable users to create their own adaptive attacks that can be customized to the specific AI model under test. Capabilities will provide users with the ability to control the attack severity and assess performance given varying assumptions on adversary knowledge and adversary capabilities.
JATIC will be comprised of several independent but mutually compatible software libraries for AI T&E which provide capabilities in the above dimensions. JATIC software will be usable within many different T&E workflows and ML technology stacks to support wide adoption throughout the DoD. For instance, JATIC libraries will be usable independently within Jupyter Notebooks to enable quick model testing and metrics visualization (e.g., gradio, streamlit), and also within production-level ML pipelines - leveraging features such as test automation, experiment tracking, and visualization dashboards - to enable powerful new insights into model performance. In addition, while JATIC is primarily focused on supporting T&E of models after training, the software will be applicable to many other phases of the AI lifecycle, such as model training and hyperparameter optimization.
Below, we provide technical details on the desired end state of these software libraries.
Software Specifications
- Software will be usable as libraries within python environments.
- Software will be straightforward to run on a wide range of deployment environments, including all major operating systems, cloud deployments, etc.
- Where appropriate, software will allow users to add functions (e.g., custom metrics, perturbations, adversarial attacks) in an extensible way.
- Where appropriate, software may build upon existing open-source libraries, increasing their maturity, compatibility, or relevancy for DoD usage.
- Software will be designed to enable continual evolution with the state-of-the-art in AI T&E and updates in ML technologies. In particular, software will be designed to enable straightforward contributions from developers in the community.
Software Compatibility
- Software must address compatibility and integration. Example ML technologies to consider compatibility with include (but are not limited to):
- ML frameworks: PyTorch, TensorFlow
- ML pipelines: Databricks, SageMaker, Vertex
- Compute engines: Kubernetes, Spark, Ray
- Orchestrators: Airflow, Kubeflow, Pachyderm
- Scalability frameworks: Horovod, PyTorch Lightning
- Development shall place heavy emphasis on diligent API design in order to enable straightforward compatibility with a wide range of technologies.
Intellectual Property
Government will have at least full rights to software developed within this award, including access, distribution, and modification. The government is not interested in licensing; any solution that requires licensing or has more limited intellectual property rights will not be considered.
Collaboration
Interested parties will collaborate with the government and other developers in order to create a set of interoperable T&E capabilities. In particular, API design and dependency management will require detailed collaboration to ensure compatibility.
The total award amount is projected to be $12MM.
This amount will be distributed at the government's discretion across allthe AI T&E dimensions mentioned above, to strongest vendor(s) within each area. It is not necessarily the case that a single vendor will be selected for each dimension. The government may select a single vendor for multiple dimensions, multiple vendors for a single dimension, or may choose not to select any vendors for a given dimension. In particular, if no responses demonstrate improvements from existing state-of-the-art capabilities in a given dimension (value proposition topic above), no vendors may be chosen.