Page 46 Cite Bookmark

Suggested Citation: "3 System Engineering with Machine Learning Components for Safety-Critical Applications." National Academies of Sciences, Engineering, and Medicine. 2025. Machine Learning for Safety-Critical Applications: Opportunities, Challenges, and a Research Agenda. Washington, DC: The National Academies Press. doi: 10.17226/27970.

3

System Engineering with Machine Learning Components for Safety-Critical Applications

In recent years, the rapid advancement of machine learning (ML) has led to their early integration into safety-critical systems. As noted in Chapter 2, these technologies offer significant potential for enabling new features, improving efficiency, and enhancing safety; yet, failures in safety-critical systems can have often have severe and catastrophic consequences.

The emerging risks described in Chapter 2 make clear that safety considerations are not always apparent until a system operates in the real world. By analyzing diverse applications, shared challenges, risks, best design practices, and use-case development can guide the initial development of formal risk management methods and standards for ML integration in safety-critical systems. Formal safety development practices, frameworks, and standards, if widely recognized, can then influence domain-specific design, assessment, and regulatory obligations that ensure the safety of people and property.

This chapter describes needed changes to the system engineering process to specify, implement, verify, validate, and operate safety-critical systems that incorporate ML components. The next section discusses the current state of practice in building and certifying safety-critical systems. Section 3.2 turns to the challenges posed for existing practice both by the unique properties of ML technology and by the novel aspects of the applications discussed in Chapter 2. The implications of these for both the engineering process and system architectures are discussed and summarized in several findings. Section 3.3 steps through the V-model engineering process and instantiates each phase for creating components with ML. This chapter focuses on the engineering process challenges, whereas Chapter 4 discusses the needs for additional research on the ML technology itself.

Page 47 Cite Bookmark

Suggested Citation: "3 System Engineering with Machine Learning Components for Safety-Critical Applications." National Academies of Sciences, Engineering, and Medicine. 2025. Machine Learning for Safety-Critical Applications: Opportunities, Challenges, and a Research Agenda. Washington, DC: The National Academies Press. doi: 10.17226/27970.

3.1 STATE OF PRACTICE IN SAFETY-CRITICAL SYSTEM DESIGN

Reliable and functionally safe operation is achieved in safety-critical systems through a systematic and proactive approach to identifying and mitigating hazards by implementing safety measures. Some of the basic tools used include hazard and risk analysis, system specification, redundancy, failure mode and effect analysis (FMEA), rigorous verification and validation, code analysis and real-time monitoring and maintenance. A functional safety assessment, introduced in Chapter 1, can provide a structured approach to hazard mitigation and safety-critical system development.

Safety-critical systems are designed to actively respond to faults, potential threats or dangerous situations and to mitigate those risks through application of a formal risk management system (described below). These system designs may incorporate a combination of inherently safe elements, active or passive safeguards, and predictable human interaction to ensure safe outcomes within a defined safety tolerance for the systems.¹ To this end,

Potential hazards must be identified, and their risk assessed, and
Safeguards needed to eliminate or mitigate each hazard must be identified and implemented.

A typical autonomous machine or robotic system that includes programmable components is most often structured with a group of sensors (input), decision and control logic systems, communications systems, and actuators or activation components that all work within an operational environment to achieve the design’s safety objectives. Safety analysis of a system depends on the correct and complete knowledge of all aspects of the system and the system constraints. Because risk associated with the safety of persons and property often depends on unique applications and use environments, there are both basic “horizontal” risk management standards that apply to particular technologies (e.g., IEC 61508 series)² and “vertical” standards specific to individual domains (see Section 3.5). These define the risk management practices and actualization methods expected within these highly regulated safety-engineering domains.

Functional safety assessments provide a systematic approach to ensuring the reliability and safety performance of the system in the intended real-world environment. These approaches, tools, and risk management processes are well defined within the framework standards that support the design, development, and deployment of a safety

___________________

¹ ISO, 2024, “Artificial Intelligence—Functional Safety and AI Systems,” ISO/IEC TR 5469:2024, https://www.iso.org/standard/81283.html.

² IEC, 2010, “Functional Safety of Electrical/Electronic/Programmable Electronic Safety-Related Systems—Parts 1 to 7,” IEC 61508:2010 CMW, https://webstore.iec.ch/en/publication/22273.

Page 48 Cite Bookmark

Suggested Citation: "3 System Engineering with Machine Learning Components for Safety-Critical Applications." National Academies of Sciences, Engineering, and Medicine. 2025. Machine Learning for Safety-Critical Applications: Opportunities, Challenges, and a Research Agenda. Washington, DC: The National Academies Press. doi: 10.17226/27970.

system within a given operation. The basic steps in a functional safety assessment for most safety-critical applications are the following:

Hazard analysis: Identifying and examining potential hazards and risks associated with the system. This may include assessing the potential consequences of system or component failures and identifying the causes of those failures.
Risk assessment: Evaluating the likelihood and consequences of the hazards and risks identified in the hazard analysis. This helps to prioritize safeguards to be implemented and controlled.
Mitigation development: Developing safety risk mitigations for all safety risks identified during hazard and risk analysis and clearly explaining them.
Safety requirements specification: Defining the safety requirements for the system, including the performance and reliability requirements. This includes specifying the safety measures to be implemented and the requirements for testing and verifying each measure. There can be system-level, hardware-specific, and software-specific requirements specifications. Software-specific requirements include a requirements analysis to identify the safety-critical functions that the software needs to perform, as well as any safety standards or regulations that apply.
Design and development: Designing and developing the system to meet the safety requirements specified in step 4. This includes implementing the safety measures and validating that the system is designed to be safe and reliable.
Verification³ and validation⁴: Verifying and validating that the system meets the safety requirements. This includes testing the system to ensure that it performs as intended and that the safety measures are effective. Techniques include the following:
1. Safety cases providing assurance,
2. Software testing, and
3. Model-based verification.
Maintenance and monitoring: Maintaining the system and monitoring its performance over time for anomalies and degradation. Performing maintenance and updating the system as needed to ensure that it continues to meet the safety requirements.
Certification and documentation: While evidence of compliance is always necessary to demonstrate that the system meets relevant requirements and

___________________

³ Verification of the components (including software that meets the specifications and is deployed correctly; the system is built correctly).

⁴ Validation that the systems or component meet its intended safety purpose (fit for purpose; was the right system built).

Page 49 Cite Bookmark

Suggested Citation: "3 System Engineering with Machine Learning Components for Safety-Critical Applications." National Academies of Sciences, Engineering, and Medicine. 2025. Machine Learning for Safety-Critical Applications: Opportunities, Challenges, and a Research Agenda. Washington, DC: The National Academies Press. doi: 10.17226/27970.

regulations, in some domains/sectors or specific regulatory jurisdictions, specific certification and documentation is normative. This may include obtaining the necessary certification and documenting the functional safety assessment process and results. This includes documenting all changes (including maintenance) made to the system.

A good example of tools used to develop safety cases is UL 4600, “Standard for Evaluation of Autonomous Products,”⁵ which develops a detailed and evolving set of safety cases to document all aspects of an autonomous system’s safety, including design decisions, risk assessments, testing procedures, and mitigation strategies. Each case is based on a safety argumentation approach that requires each safety claim be supported by evidence and logical reasoning, thus creating a clear chain of justification for why the system should be considered safe.

Requirements for software components generally also follow the system-level approaches that include software safety specification to ensure that the architecture, tools, programming language, and code elements are properly and effectively integrated, updated, and maintained to ensure the safety and functionality of the system over its lifetime. Traditional approaches to software or code verification generally include static and dynamic analysis for vulnerabilities, model checking, and code or software requirements tracing. These approaches may not be sufficient for ML components where new techniques such as model interpretability, uncertainty quantification, or runtime monitoring would need to take on more importance.

In addition to traditional software safety expectations, the interconnected nature of safety-critical systems introduces new safety risks from adversarial attacks. While tools like redundancies, formal verification, human oversight, and configuration controls are helpful in deterministic systems, integrating ML components will require new defense strategies and adaptations.

3.2 CHALLENGES OF INTEGRATING MACHINE LEARNING INTO THE SAFETY ENGINEERING PROCESS FOR INCREASED FUNCTIONALITY

As discussed in Chapters 1 and 2, exciting advances in ML have motivated engineers to extend the functionality of existing safety-critical systems (e.g., increasing the level of autonomy of automobiles) and enter entirely new application domains (e.g., infrastructure,

___________________

⁵ UL Solutions, 2023, “UL 4600 Edition 3 Updates Incorporate Autonomous Trucking,” https://www.ul.com/news/ul-4600-edition-3-updates-incorporate-autonomous-trucking.

Page 50 Cite Bookmark

Suggested Citation: "3 System Engineering with Machine Learning Components for Safety-Critical Applications." National Academies of Sciences, Engineering, and Medicine. 2025. Machine Learning for Safety-Critical Applications: Opportunities, Challenges, and a Research Agenda. Washington, DC: The National Academies Press. doi: 10.17226/27970.

health care, manufacturing, and human–robot interaction). This combination of a novel technology—ML—and novel domain properties poses many challenges for the safety engineering process.

High Error Rates for Sensing the State of the Environment

Computer vision methods for sensing the environment are limited in their accuracy by the following two factors: (1) the visual signal itself can be ambiguous, especially for objects that are distant from the camera, and especially under low light or weather conditions of rain, fog, snow, smoke, and dust storms and (2) the performance of these vision systems, because they are constructed using ML, is not perfect.

Many aspects can contribute to the second factor. First, if the training data contains gaps or otherwise does not cover the entire operational domain (OD), the object detectors and classifiers may reach incorrect decisions. Second, ML methods tend to favor the more common cases and commit more mistakes in less common situations. Third, statistical learning may also result in spurious correlations that are relied upon. For example, a system for distinguishing between bears and dogs was discovered to be relying on snow in the background as critical evidence for bears. Finally, the training data may contain errors in the human annotations. For example, in object recognition, a person must annotate each training image by placing a box around each object in the image and assigning a class label to the contents of that box. If the box is missing, poorly positioned, or the label is incorrect, this can damage performance. In summary, perception systems built with ML will never be perfect, and system design must take this into consideration.

Detecting Whether the System Is Operating Within the Operational Design Domain Relies on High-Error-Rate Sensing

To provide a safety guarantee, safety engineering must require that a system is operated within its OD. Ideally, the system has an “OD detector” that can tell whether it is in the OD and refuse to operate if it is not. For example, a Level 4 driver assistance system might be designed to operate only on freeways during daytime and in good weather. An OD detector for such a system might use the Global Positioning System to determine position and computer vision to measure ambient lighting and check for fog or precipitation. If the OD detector relies on an ML component that itself will make errors, then the OD detector will also make errors that could threaten system safety.

Page 51 Cite Bookmark

Suggested Citation: "3 System Engineering with Machine Learning Components for Safety-Critical Applications." National Academies of Sciences, Engineering, and Medicine. 2025. Machine Learning for Safety-Critical Applications: Opportunities, Challenges, and a Research Agenda. Washington, DC: The National Academies Press. doi: 10.17226/27970.

Controller Stability

Reinforcement learning methods—especially ones based on neural networks—can discover highly effective controllers for complex, nonlinear systems. However, relying on these controllers may lead to unstable behavior that results in crashes.

Distribution Shift

Distribution shift is a mismatch between the training data distribution and the distribution of data encountered after deployment. It can result from three different causes. First, if the training data does not cover the full operational design domain (ODD) and the deployed system enters a part of the ODD that was poorly covered, then performance will drop. This is known as data-to-real distribution shift. Figure 3-1 shows images of stop signs from the CURE-TSR data set⁶ collected under different environmental and imaging conditions. A system trained on the top row would have difficulty recognizing the signs in the bottom row, which were collected under more extreme conditions. An example from robot grasping is shown in Figure 3-2 where the success rate for grasping drops drastically when changes occur in the scene or in the robot gripper geometry.

A second cause of distribution shift occurs when the training data are constructed from simulations. This is often necessary when it is hard to collect real data for rare but dangerous scenarios. This is known as sim-to-real shift. Third, statistical learning tends to achieve overall low error rates by trading lower performance on rare cases to attain higher performance on more common cases. If the distribution of situations changes so that rare cases become more common, error rates will increase. This problem is frequently encountered when a medical imaging system is trained on data from a university hospital and then deployed in a hospital with a very different mix of patients (e.g., military veterans, people living in rural areas).

STOP signs under challenging conditions in real environments (top row) and harsh environments (bottom row). — **FIGURE 3-1** STOP signs under challenging conditions in real environments (*top row*) and harsh environments (*bottom row*).
SOURCE: D. Temel, G. Kwon, M. Prabhushankar, and G. Al-Regib, 2020, “CURE-TSR: Challenging Unreal and Real Environments for Traffic Sign Recognition,” Zenodo, https://doi.org/10.5281/zenodo.3903066. CC BY 4.0.

___________________

⁶ D. Temel, G. Kwon, M. Prabhushankar, and G. AlRegib, 2017, “CURE-TSR: Challenging Unreal and Real Environments for Traffic Sign Recognition,” arXiv preprint arXiv:1712.02463.

Page 52 Cite Bookmark

Suggested Citation: "3 System Engineering with Machine Learning Components for Safety-Critical Applications." National Academies of Sciences, Engineering, and Medicine. 2025. Machine Learning for Safety-Critical Applications: Opportunities, Challenges, and a Research Agenda. Washington, DC: The National Academies Press. doi: 10.17226/27970.

**FIGURE 3-2** Machine learning–based object grasp under normal conditions and various harsh environment variations.
SOURCE: Courtesy of Ryan Julian.

Machine Learning Is Vulnerable to Adversarial Attacks

As mentioned in Chapter 1 (see Figure 1-5), deep learning methods in computer vision can be damaged through adversarial attacks. While the school bus in Figure 1-5 involves direct manipulation of the image, groups have demonstrated that various stickers and other modifications can fool computer vision systems in the real world. Figure 3-3 shows recent progress on RobustBench, which is a benchmark for testing robustness of deep learning to adversarial attacks. While there has been progress, the ML community has not been able to reach robust accuracy much more than 70 percent on this well-established benchmark.

Systems Are Difficult to Specify and Verify Formally

It is harder to formalize precise safety requirements when a system includes ML components. For example, with the very large ML models used in computer vision and perception models, it is not clear how to specify properties at a semantic level (e.g., how would one specify that there is a car in an image) or at the model level owing to the high-dimensional nature of deep networks.

Formal methods have been developed for certifying robustness to some classes of adversarial attacks (e.g., attacks that change the Euclidean norm of the image pixels by less than some specified amount). But these classes are highly restrictive, and the methods do not scale to large, deep neural networks. A consequence is that verification is forced to rely on testing. This includes testing for coverage of the OD, testing to estimate error rates, and testing to detect learning failures such as spurious correlations and vulnerability to distribution shifts and adversarial attacks.

Page 53 Cite Bookmark

Suggested Citation: "3 System Engineering with Machine Learning Components for Safety-Critical Applications." National Academies of Sciences, Engineering, and Medicine. 2025. Machine Learning for Safety-Critical Applications: Opportunities, Challenges, and a Research Agenda. Washington, DC: The National Academies Press. doi: 10.17226/27970.

**FIGURE 3-3** Progress on machine learning robustness.
SOURCE: F. Croce, M. Andriushchenko, V. Sehwag, E. Debenedetti, N. Flammarion, M. Chiang, P. Mittal, and M. Hein, 2020, “Robustbench: A Standardized Adversarial Robustness Benchmark,” *arXiv* abs/2010.09670.

Open Worlds

Many of the application domains discussed in Chapter 2 are open worlds, where any system operating in them will encounter novel phenomena (e.g., novel objects, novel vehicles, or novel diseases) even when the system is within the OD. This has two extremely important consequences.

First, safety-critical systems operating in open worlds must detect novelty and respond appropriately. Object detectors must be able to detect novel objects, object classifiers must be able to determine that they are looking at a novel type of object, and so on.

Second, systems must adapt continually as novelties are detected. For example, if the system encounters a novel kind of vehicle, there needs to be a process for collecting training (and verification) data for that vehicle type, retraining the perceptual system to handle that type, and modifying the control system (e.g., path planning) to behave appropriately in the presence of this new type of vehicle. This outer, continuous improvement loop is a requirement for systems operating in open worlds.

Page 54 Cite Bookmark

Suggested Citation: "3 System Engineering with Machine Learning Components for Safety-Critical Applications." National Academies of Sciences, Engineering, and Medicine. 2025. Machine Learning for Safety-Critical Applications: Opportunities, Challenges, and a Research Agenda. Washington, DC: The National Academies Press. doi: 10.17226/27970.

Novelty Detection Is Also Never Perfect

Novelty detection works by comparing a new object to the known types of objects that the system was trained on. If a system is to detect a novelty, it must map the new object to an internal representation that is different from the representations of known objects. Unfortunately, deep learning systems are “lazy” in the sense that they only learn to represent those aspects of images that are required for the training task. Supervised classifiers only learn features sufficient to discriminate each class of object from every other class. Unsupervised or self-supervised methods learn features sufficient to capture the variability of the training data. If a novel class of objects varies “along a direction” that did not vary in the training data, then the novel class will not be detected as novel (along that direction). Current work addresses this problem by training on as much diverse data as possible, but there is no guarantee that the training data will have exhibited all of the possible directions of variation required to guarantee successful novelty detection.

Machine-Learned Components Are Trained, Calibrated, and Verified Using Data

Fundamentally, the correct functioning of an ML component is specified by its data. Because it is not feasible to collect data for every possible combination of system inputs (e.g., every input image combined with every lidar point cloud), this specification will always be incomplete. Safety depends on assuming a form of smoothness such that any input obtained by “interpolating” between training data points should result in the response obtained by “interpolating” between the labels of those data points, which is inherently difficult for some ML algorithms. “Interpolate” is used here informally, but efforts have been made to formalize this notion.

The Optimal Response of the System Depends on Uncertainty

Traditional safety engineering seeks to remove as much uncertainty as possible. But when sensing is imperfect, the system should hedge against the uncertainty. For example, suppose the vision system cannot determine the identity of an object that is 100 m ahead of the vehicle. In this case, the vehicle should slow down and move closer, which may allow the vision system to recognize the object. At night, the vehicle could turn up its headlights to improve the illumination of the object. If the object is novel and appears to be capable of motion, the vehicle should give it a very large margin of safety. In short, the methods of decision making under uncertainty should be applied, and the optimal control policy will typically combine additional sensing with more cautious actions.

Overall, this section’s inventory of the challenges introduced by integrating ML methods into safety-critical systems and applying them in open worlds resulted in the following observations.

Page 55 Cite Bookmark

Suggested Citation: "3 System Engineering with Machine Learning Components for Safety-Critical Applications." National Academies of Sciences, Engineering, and Medicine. 2025. Machine Learning for Safety-Critical Applications: Opportunities, Challenges, and a Research Agenda. Washington, DC: The National Academies Press. doi: 10.17226/27970.

Finding 3-1: System components constructed with ML-enabled capabilities exhibit significant error rates both because of incomplete sensing of the environment (e.g., occluded objects in computer vision) and because of their inherently statistical nature.

Finding 3-2: Decision making based on the outputs of ML components inherently involves uncertainty. The system thus needs to take actions to hedge against uncertainty or reduce uncertainty.

Finding 3-3: In many cases, ML-enabled safety-critical systems operate in an open world where novelty will be encountered. These systems need to implement an “outer loop” in which novelties are detected and characterized; and the system needs to be extended, through data collection and retraining, to properly handle the discovered novelties.

3.3 INTEGRATING MACHINE LEARNING INTO SAFETY ENGINEERING

The V-model of safety engineering needs to be changed to incorporate methodologies better suited to ML components. The subsections that follow consider some possible engineering approaches in each of these phases. The V-model (see Figure 1-2) is considered in the following three phases: (1) requirements and design (the left, descending branch of the V), (2) implementation (the vertex of the V), and (3) verification and validation (the right, ascending branch of the V). These active explorations within the research and development communities, and other approaches, are certainly possible.

Incorporating Machine Learning into the Requirements and Design Phase

One promising approach to ML safety engineering is known as scenario-driven development. In this approach, the requirements phase begins with the construction of a scenario catalog that breaks the OD into an exhaustive set of operational scenarios. While an industrial robot or a warehouse moving device can have reasonably predictable environments, a self-driving vehicle has a much less predictable operating environment. The self-driving automobile must consider making a right turn across a pedestrian crosswalk, merging onto an expressway, and so forth. Scenarios may specify weather conditions and properties of the scene (e.g., the presence of parked vehicles, pedestrians,

Page 56 Cite Bookmark

Suggested Citation: "3 System Engineering with Machine Learning Components for Safety-Critical Applications." National Academies of Sciences, Engineering, and Medicine. 2025. Machine Learning for Safety-Critical Applications: Opportunities, Challenges, and a Research Agenda. Washington, DC: The National Academies Press. doi: 10.17226/27970.

bicyclists).⁷ Scenarios should also include any adversarial attacks and other factors to which the system must be robust.

Once the scenarios are developed, the ML-based subsystems must be defined and their requirements specified. For ML-based computer vision, there will typically be at least three components—(1) object detection and recognition, (2) novelty detection, and (3) operational domain detection. Figure 3-4 shows the typical architecture for an autonomous vehicle. Notice that the perception subsystem contains eight separate computer vision components specialized for various sensors and detection tasks. The outputs of these systems go to data fusion and simultaneous localization and mapping (SLAM) modules, which may also be trained via ML. In the decision and planning subsystem, separate world trajectory and local trajectory components plan the vehicle trajectory.

**FIGURE 3-4** Modern systems architecture for autonomous vehicles with machine learning components.
NOTE: GPS, Global Positioning System; OS, Operating System; RTK, Real-Time Kinetic; SLAM, Simultaneous Localization and Mapping; UDP, User Datagram Protocol.
SOURCE: J. Ren and D. Xia, 2023, “Autonomous Driving Software Architecture,” Pp. 263–281 in *Autonomous Driving Algorithms and Its IC Design*, Springer Nature.

___________________

⁷ B. Kramer, C. Neurohr, M. Büker, E. Böde, M. Fränzle, and W. Damm, 2020, “Identification and Quantification of Hazardous Scenarios for Automated Driving,” IMBSA: Model-Based Safety and Assessment 163–178.

Page 57 Cite Bookmark

Suggested Citation: "3 System Engineering with Machine Learning Components for Safety-Critical Applications." National Academies of Sciences, Engineering, and Medicine. 2025. Machine Learning for Safety-Critical Applications: Opportunities, Challenges, and a Research Agenda. Washington, DC: The National Academies Press. doi: 10.17226/27970.

These could be constructed via reinforcement learning methods or by traditional path planning techniques. Similarly, the control module could be built with standard control synthesis methods, with ML methods, or with a combination of the two.

Unlike in traditional system design where each component has a written functional specification, the specification for each learning component is a set of training examples. Hence, for each scenario and each ML component, the designers need to specify the amount and properties of the training data required to train the component. This typically involves some exploratory data collection and training to assess the variability of the data (e.g., environmental variation, sensor measurement noise, and human annotation error rates) and the learning curve for the component. The learning curve plots the learned accuracy as a function of the amount of training data. These curves often follow scaling laws that can be applied across domains. From the learning curve or scaling law, the designers can set an accuracy target for the component and estimate the amount of data needed to achieve that target. The choice of the accuracy target reflects a trade-off between the cost of data collection, the resulting error rate of the component, and the cost and complexity of mitigating errors elsewhere in the design.

Each ML component in the system should also have its own novelty detector that is responsible for raising an alarm if the input is novel and cannot be reliably processed by the learned component. Specifying the data collection for a novelty detector is inherently challenging. The data collected for training the ML component provides examples of non-novel inputs. But the idea of collecting novel inputs is paradoxical: If one can collect novel inputs, one will usually want to convert them to non-novel inputs by training on them. Nonetheless, the designers can specify “dimensions of variability” that can be expected to be encountered in the environment, and then data can be collected or synthesized by varying inputs along these dimensions. For example, one could vary the height and width, color, and speed of other vehicles. For similar reasons, it is difficult to determine the accuracy of a novelty detector or to compute a learning curve to make sample size decisions. One approach is to hold out a collection of novel and non-novel inputs and evaluate novelty detection performance on these data points.

To specify the training data for the OD detector, the designers must collect data illustrating all the ways that the system could depart from the OD. The goal is to “surround” the OD with training examples and then train a classifier to decide when the system is inside the OD. A challenge is to select an operating point such that essentially all departures from the OD are detected while achieving an acceptable false alarm rate.

Each learning component must quantify its uncertainty, and additional data are needed to calibrate these uncertainty values. Experience has shown that comparatively little data are needed for calibration, but it is important that these data reflect as closely as possible the distribution of data points encountered after deployment.

Page 58 Cite Bookmark

Suggested Citation: "3 System Engineering with Machine Learning Components for Safety-Critical Applications." National Academies of Sciences, Engineering, and Medicine. 2025. Machine Learning for Safety-Critical Applications: Opportunities, Challenges, and a Research Agenda. Washington, DC: The National Academies Press. doi: 10.17226/27970.

A final activity within the requirements phase is to develop requirements for verification and validation. As with traditional safety engineering, the verification and validation (V&V) team should be separate from the requirements and implementation teams. In the ML case, this means that the V&V team should build its own requirements for data collection for each of the ML-based components. Indeed, one important way to validate the entire requirements process is to have multiple teams develop requirements independently. If there is good agreement across those teams, this is an important sign of requirements validity.

Finding 3-4: ML components need to be specified in terms of the amount of training data and the dimensions of variability that the data must exhibit. New data engineering tools are needed to support the specification process.

Machine Learning in the Implementation Phase

The main activities during the implementation phase are collecting the data and training the ML components. If the estimated data set size does not result in ML components with sufficiently high accuracy, additional data may need to be collected. It is important to decompose the error rate of the classifier into epistemic and aleatoric uncertainty. Epistemic uncertainty results from insufficient data, whereas aleatoric uncertainty can result from many causes, including measurement noise in the data, noise in the annotations, and unmeasured aspects of the environment (hidden confounders). Epistemic uncertainty is ideally estimated via ensemble or Bayesian methods. But an examination of the learning curve can show—if it is still increasing—that more data will improve performance. Most ML models have parameters that estimate the aleatoric uncertainty. For example, the output probabilities of a classifier on held out data—to the extent they depart from 0 and 1—can be interpreted as estimates of aleatoric uncertainty. It is important to track down the sources of aleatoric uncertainty to see if they can be reduced or eliminated. Can better sensors improve measurement accuracy and reproducibility? Can the data annotation process be improved? Some uncertainty, however, cannot be eliminated. For example, the cameras on a vehicle cannot see what is around a corner or behind another vehicle.

In addition to training the ML components, the implementation team will need to calibrate the uncertainty representations produced by each component. For an object classifier, the uncertainty representation is simply the probability that the object belongs to object class. But for other ML components, the representation can be more complex. For example, an object detector might provide a probability distribution over the potential locations of an object. And a component that predicts the future behavior of workers, trainers, pedestrians, animals, and other vehicles might provide a set of possible

Page 59 Cite Bookmark

Suggested Citation: "3 System Engineering with Machine Learning Components for Safety-Critical Applications." National Academies of Sciences, Engineering, and Medicine. 2025. Machine Learning for Safety-Critical Applications: Opportunities, Challenges, and a Research Agenda. Washington, DC: The National Academies Press. doi: 10.17226/27970.

near-term destinations for each. Such representations are well calibrated if, when the learning component claims that the ground truth matches the representation with probability p, the ground truth does in fact satisfy the representation with frequency p in the actual scenario. Hence, if the system claims that the set of near-term destinations for a pedestrian contains the true destination with probability 0.99, this must hold 99 percent of the time.

Traditionally, learning algorithms seek to minimize the average loss (error rate) over the entire training data set. However, in the scenario-based approach, different scenarios may lead to more severe harms, and therefore, error rates on some scenarios should be much smaller than error rates on other scenarios. Cost-sensitive learning methods can be applied to achieve this.

In addition to collecting data for the training process, the separate V&V team also needs to collect data for the V&V phase. As with all aspects of V&V, this work should be performed by a separate data management team.

Finding 3-5: To support decision making under uncertainty, ML components need to provide calibrated representations of their uncertainty.

Machine Learning in the Verification and Validation Phase

The primary goal of the verification process for ML systems is to verify that the system is achieving the required performance in each scenario. This includes not only the basic functional performance, but also novelty detection performance and OD detection.

Each ML component can be evaluated on the verification data to measure error rates and uncertainty calibration within each scenario. Novelty detection accuracy must be measured on held out data points that were not considered during the training process. By varying the number of known novelties used during training, engineers can map out a learning curve for the novelty detector. By extrapolating this curve, they can obtain an estimate of the error rate of the novelty detector. Similarly, held out novelties can be applied to calibrate the estimated probability of novelty. However, the validity of all these estimates depends critically on the assumption that the known novelties provide a representative sample of future novelties. Unfortunately, this assumption cannot be tested prior to system deployment.

Each ML component must also be tested for robustness to adversarial attacks. The verification team will have designed adversarial test data for this purpose. In addition, optimization methods can be applied to automatically design attacks.

Evaluating the OD detector is more straightforward. However, the validity of the estimated error rate of the OD detector depends on the assumption that the verification data are a representative sample of the system/environmental states that lie outside the

Page 60 Cite Bookmark

Suggested Citation: "3 System Engineering with Machine Learning Components for Safety-Critical Applications." National Academies of Sciences, Engineering, and Medicine. 2025. Machine Learning for Safety-Critical Applications: Opportunities, Challenges, and a Research Agenda. Washington, DC: The National Academies Press. doi: 10.17226/27970.

OD. This is easier to assure by having independent teams specify and collect test data for the OD detector.

As discussed above, statistical ML methods can be fooled by spurious correlations in the training and testing data. The validation process must check that the ML components have not relied on so-called “short cuts” in achieving high accuracy. For this purpose, it is important that the ML components have an explanation capability that permits engineers to examine which input variables or image regions are responsible for the classifier’s predictions. Mutation tests that modify image backgrounds or randomly alter irrelevant inputs provide another means of detecting short cuts. Similar methods can be applied to the training data to break accidental correlations so that they are not learned in the first place.

Finding 3-6: ML components need to be tested to ensure that they have not learned spurious correlations. Advances in test data design and in ML explanation capabilities are needed to achieve this goal.

Finding 3-7: The validity of ML components rests on assumptions, particularly for novelty detection and OD detection, which cannot be checked prior to deployment. Specifically, validity assumes that the training data exercises all important directions of variation along which novel and out-of-operational domain data will arise.

This scenario-based approach is a fundamental change in the way ML systems are developed. Instead of assuming a stationary probability distribution from which both training and test data are drawn, the training data are engineered to achieve different levels of accuracy within each scenario. This prevents the machine learning process from sacrificing performance in rare parts of the input space to achieve higher performance in more common input regions. Calibration of uncertainty representations is still necessarily based on an assumed deployment distribution. Methods for online adaptation of uncertainty calibration can be applied to track changes in the deployment distribution.

Finding 3-8: Current ML components fall short of safety-critical standards because they rely on statistical assumptions that discount rare events and cannot guarantee consistent performance across all operating conditions. To close this gap, ML components must be redesigned to maintain verified levels of accuracy throughout their entire OD.

Machine Learning for Safety-Critical Applications: Opportunities, Challenges, and a Research Agenda (2025)

Chapter: 3 System Engineering with Machine Learning Components for Safety-Critical Applications

3

System Engineering with Machine Learning Components for Safety-Critical Applications

3.1 STATE OF PRACTICE IN SAFETY-CRITICAL SYSTEM DESIGN