In a session moderated by Leo Chiang, senior research and development fellow at the Dow Chemical Company, and Sylvain Costes, former National Aeronautics and Space Administration data officer for space, biological, and physical sciences, four speakers discussed issues of bias, ethics, and regulation related to the use of artificial intelligence (AI) in radiation therapy, medical diagnostics, and occupational health and safety.
Matt Dennis, data scientist in the Nuclear Regulatory Commission’s (NRC’s) Office of Nuclear Regulatory Research, outlined the agency’s comprehensive approach to AI integration.
Dennis started by discussing how the nuclear industry, traditionally characterized as slow to adopt new technologies, faces a unique paradox. Most U.S. nuclear power plants currently operate using 1960s and 1970s analog technology—complete with dials, gauges, and those distinctive green control rooms that defined the pre-digital age. He warned that the industry should not ignore AI’s transformative potential.
The regulated entities under NRC oversight are eager to harness AI primarily for efficiency gains, seeing immense potential to make operations “better, faster, and cheaper.” However, the NRC’s fundamental principle remains unwavering: efficiency cannot come at the sacrifice of safety and security. This tension between innovation and caution defines the current regulatory landscape, where the nuclear industry must pivot more quickly than ever before while maintaining its exemplary safety record.
NRC’s strategic response centers on enabling the safe and secure deployment of AI in the nuclear sector while promoting stakeholder engagement and encouraging research and development. This dual approach recognizes that the agency is attempting to simultaneously prepare to regulate AI applications and leverage AI to become more efficient regulators themselves.
Dennis noted that NRC performed a comprehensive analysis that identified eight potential gap categories where current regulations, originally written with human-centric operations in mind, may need to be adapted for AI applications. Two areas particularly relevant to radiation protection illustrate these challenges.
The first area Dennis raised is that radiation safety support programs traditionally rely on radiation safety officers, committees, and support staff who manually collect dosimetry data, develop reports, and submit documentation. The prospect of AI tools that could collect, report, predict, and manage entire radiation protection
programs raises fundamental questions about how such systems would integrate with existing human-centric regulations and guidance.
Second, the gap analysis revealed concerns about training and human factors engineering, particularly regarding how operators should be trained to use AI aids and tools. This parallels challenges in medical imaging, where radiologists could learn to work with AI-generated reports while maintaining critical human oversight. The nuclear industry’s commitment to keeping “humans in the loop” reflects a broader ethical imperative to ensure AI serves as a tool that enhances rather than replaces human judgment in safety-critical decisions.
The NRC’s approach extends beyond isolated domestic efforts, emphasizing collaboration with federal counterparts, international partners, and professional organizations. The agency maintains memoranda of understanding with the Electric Power Research Institute and the Department of Energy, ensuring alignment with industry developments and research initiatives.
Internationally, the NRC collaborated with the Canadian Nuclear Safety Commission (CNSC) and the United Kingdom’s Office for Nuclear Regulation to develop the “CANUKUS AI Principles,” which mirror similar efforts by health regulators like the Food and Drug Administration (FDA), Health Canada, and the Medicines and Healthcare Products Regulatory Agency. These principles provide high-level consensus guidance that transcends specific applications while addressing the unique challenges of nuclear regulation (CNSC, U.K. Office for Nuclear Regulation, and U.S. Nuclear Regulatory Commission, 2024). The alignment with FDA’s predetermined change control processes for AI-enabled medical devices demonstrates how nuclear regulation can adapt existing frameworks rather than creating entirely new regulatory structures.
This collaborative approach proves particularly valuable given the lag in professional standards development. Traditional standards organizations struggle to keep pace with AI’s rapid evolution, creating a timing mismatch between technological capabilities and regulatory guidance. The NRC’s proactive engagement with state counterparts through the Conference of Radiation Control Program Directors addresses this challenge, particularly for radioactive materials regulation, in which states play a primary role.
Dennis continued stating that the practical implementation of AI in nuclear applications presents both technical and organizational challenges. Given the industry’s analog heritage, integrating AI tools involves significant infrastructure investments and cultural shifts. However, specific use cases demonstrate AI’s transformative potential. Nondestructive examination of weld defects exemplifies this opportunity. Currently requiring human inspectors to analyze thousands of images, this task represents an ideal application for computer vision algorithms that could enhance both efficiency and accuracy.
The NRC’s internal adoption strategy identified 61 potential use cases, with 36 showing clear AI application potential (NRC, 2024). These range from basic productivity tools like AI-assisted document writing to sophisticated predictive analytics for operational findings and experience analysis. This dual focus—regulating external AI applications while adopting AI internally—positions the NRC to understand the technology from both regulatory and user perspectives.
Looking forward, Dennis stated that the NRC faces the challenge of maintaining appropriate governance structures while ensuring regulatory staff possess the technical competence to review AI applications. The agency’s annual public workshops and symposiums serve as critical touchpoints with industry, academia, and research institutions, fostering the collaborative environment necessary for responsible AI integration.
The NRC’s approach to AI integration reflects a careful balance between innovation and responsibility. By maintaining safety and security as nonnegotiable principles while actively engaging with technological advancement, the agency positions itself to navigate the complex intersection of AI and nuclear regulation. Dennis stated that the success of this approach will depend on continued collaboration, adaptive regulation, and unwavering commitment to the public interest that defines nuclear regulatory oversight (NRC, 2023).
Dennis ended stating that as the nuclear industry prepares for a potential renaissance driven by AI’s energy demands, the NRC’s approach to AI integration could serve as a model for how regulatory agencies can embrace technological transformation while fulfilling their fundamental mission of protecting public health and safety.
Etta Pisano, chief research officer of the American College of Radiology, spoke about the use of AI for breast cancer screening, which could offer insights into the use of AI in other areas. She focused on what AI can do, how it is being evaluated, and how it should be evaluated.
Pisano began by talking about the fundamentals of breast cancer screening. Breast cancer screening in the United States follows a well-established protocol for asymptomatic women, with screening typically conducted annually or biennially starting around age 40 and frequency determined by individual risk factors. Pisano’s presentation outlined how this process creates a complex cascade of medical decisions that presents both opportunities and challenges for AI integration.
She described how the current screening process reveals significant inefficiencies that could be addressed with AI intervention. Of all women screened, 7–15 percent are called back for additional workup, which includes extra mammography views, ultrasound, and magnetic resonance imaging. Of those recalled, only 8–10 percent ultimately receive biopsies, and of those biopsied, merely 20 percent have cancer—translating to only 5–10 cancer diagnoses per 1,000 women screened. Perhaps most concerning, approximately 40 percent of breast cancers identified through screening were actually visible in retrospect on earlier mammograms but were missed at the time of initial interpretation.
Pisano noted that this statistical reality underscores why breast cancer screening has become the most common type of AI-related radiology application submitted to FDA for review. The technology’s primary goals focus on reducing false positives at both the callback and biopsy levels while simultaneously reducing false negatives to ensure that no woman returns with a cancer that was previously visible on mammography.
FDA’s approach to evaluating AI technologies in breast cancer screening emphasizes multiple assessment criteria beyond traditional accuracy measures. Pisano stressed that while mortality outcomes represent the gold standard for technology evaluation—as demonstrated in the definitive screening mammography trials of the 1970s and 1980s—such comprehensive studies are no longer practical given ethical constraints around control groups and the extended time frames required.
Contemporary evaluation focuses on four key metrics: safety (i.e., primarily accuracy in terms of sensitivity, specificity, and overall performance), cost-effectiveness compared to standard approaches, patient convenience, and enhancement of radiologist performance. Notably, the fourth metric transcends simple accuracy measures, recognizing that AI’s value lies not just in diagnostic precision but in workflow optimization and professional support.
As of January 2025, FDA had approved 46 AI breast-imaging products for sale in the United States, although federal law and regulations maintain the requirement for human oversight—autonomous AI remains prohibited. The regulatory framework categorizes AI applications into five broad claim categories, adapted from earlier computer-assisted diagnosis terminology: triage, detection/localization, diagnosis/characterization, combined detection and diagnosis, and acquisition/optimization.
She noted that FDA approval requires testing across three distinct categories, each with inherent limitations. Standalone performance testing, the most common and practical approach, evaluates AI models against historical cases with known outcomes. While useful for training algorithms and demonstrating performance equivalency or superiority to human readers, this method faces significant challenges regarding case characterization, population representativeness, and adequate sampling of diverse breast tissue densities and lesion types.
Reader studies represent the second category, involving human readers of varying skill levels who interpret case subsets both with and without AI assistance. These studies face their own limitations: small sample sizes (typically 16–20 readers) cannot represent the full spectrum of radiologist capabilities, artificial study environments may alter reader behavior, and cancer case enrichment creates unrealistic prevalence rates that may not reflect real-world performance.
Clinical trials, the third category, remain largely impractical in the current regulatory environment due to ethical constraints around control groups denied screening. While cancer registry studies can approximate randomized trial outcomes, they lack the causal reliability of controlled studies.
Pisano highlighted two instructive examples where FDA-approved technologies encountered real-world implementation challenges despite rigorous approval processes. The first involved radiological AI for large vessel occlusion detection, which was approved as a triage device but subsequently used off-label for diagnosis and detection in clinical practice. This usage expansion proved problematic when the software demonstrated missed posterior circulation thromboses, illustrating the risks of off-label application beyond approved indications.
The second example involved computer-assisted diagnosis products for digital breast cancer screening, which received insurance reimbursement based on promising reader study data that suggested dramatic performance improvements. However, a comprehensive analysis published 20 years post-adoption (Lehman et al., 2015) revealed no difference in actual patient outcomes within large registry populations, demonstrating the potential disconnect between controlled study results and real-world clinical impact. Pisano said that these cases underscore the fundamental challenge that “even despite preliminary evidence from reader studies, in real life there was no difference in patient outcomes.” This highlights the inherent risks when clinical implementation occurs without comprehensive clinical trial validation.
Looking forward, Pisano suggested that real-world evidence might provide a pathway toward autonomous or semi-autonomous AI implementation in breast cancer screening. The current radiologist shortage—with approximately 10 available positions for every candidate—creates urgent pressure for AI solutions that can meaningfully reduce interpretation time, particularly for images deemed negative by AI algorithms.
Proposed implementation strategies involve installing AI algorithms that meet specific performance thresholds in standalone testing and then monitoring real-world radiologist performance through comparative studies. These might include dual-reader protocols comparing AI-assisted with unassisted interpretations, with patients retaining opt-out rights for AI applications.
Beyond screening applications, AI development focuses on short-term breast cancer risk assessment, potentially enabling more intensive screening protocols and risk-reduction pharmaceutical interventions for high-risk populations. Pisano predicted such tools would become available within 5 years. Additional applications include real-time image quality assessment by AI to alert technologists about inadequate images that require immediate retakes, same-visit workup protocols to reduce patient loss to follow-up, and direct-to-biopsy pathways for high-suspicion cases.
Pisano ended by noting significant international developments that may influence U.S. AI adoption. The European Union has advanced considerably ahead of the United States in implementing and testing AI for clinical practice. Most notably, the United Kingdom is funding a comprehensive breast cancer screening AI trial using cluster-randomization to evaluate AI as a second reader replacement, potentially eliminating the need for dual human interpretation that currently characterizes UK screening protocols.
These international initiatives may provide crucial real-world evidence that could inform future U.S. regulatory decisions and clinical implementation strategies, potentially accelerating the transition from current human-centric screening models to AI-augmented or AI-autonomous systems while maintaining patient safety and diagnostic accuracy.
Amber Simpson, Canada Research Chair in Biomedical Computing and Informatics at Queen’s University, spoke about best practices and pitfalls in addressing bias in AI-driven medical imaging. She began by briefly describing her background working at and with several institutions: The Memorial Sloan Kettering Cancer Center, the Canadian Cancer Trials Group, the Ontario Health Data Platform, and the Imaging Data Commons at the National Cancer Institute. She is a classically trained computer scientist, but after working on various projects at these different institutions, she describes herself as a “data generalist.”
Simpson discussed how imaging AI models hold transformative potential for healthcare, offering the promise of predictive and prognostic biomarkers that are quantitative, noninvasive, and cost-effective. Simpson’s presentation, drawing from comprehensive research conducted with colleagues at Memorial Sloan Kettering Cancer Center (Moskowitz et al., 2022), illuminated both this tremendous potential and the critical challenge of embedded biases that can fundamentally skew model results. Her systematic analysis reveals that understanding sources of bias and implementing effective mitigation strategies represents one of the most crucial aspects of responsible AI development in medical imaging.
The significance of addressing these biases extends beyond academic concern—biased models can perpetuate healthcare disparities, lead to misdiagnosis, and ultimately compromise patient care. Simpson’s work provides a comprehensive framework for identifying, understanding, and mitigating these biases across the entire AI development pipeline, from initial study design through final model deployment.
The architecture of research studies creates the first critical juncture where bias can infiltrate imaging AI models. Simpson identified four primary sources of study design bias that researchers should navigate carefully. Incorporation bias occurs when outcome measures rely on information from the predictors (data used to make predictions) themselves, creating a circular dependency that can artificially inflate model performance. Simpson’s laboratory faces this challenge when predicting treatment response from computed tomography (CT) images, where response is measured by diameter changes of the treated medical condition that depend on post-treatment imaging. While acknowledging this limitation, she noted that such approaches often represent necessary compromises when gold standard measures like pathological assessment involve invasive procedures unavailable for all patients and researchers are forced to accept imaging as a “bronze standard.”
Verification bias presents another significant challenge, arising when analyses include only cases where outcomes can be determined and creating a nonrepresentative subset of the target population. Simpson provided the example that this bias commonly manifests in studies that include only patients with biopsies, where biopsy decisions were based on imaging findings. Such selection creates a fundamental problem: It risks underestimating false negatives within the broader population while overestimating test sensitivity, potentially leading to overconfident model performance assessments.
Spectrum bias refers specifically to lack of diversity in disease severity or case types within the study population, significantly affecting diagnostic test performance across different clinical settings with varying patient populations. This bias frequently occurs when models are developed using only extreme cases—very sick or very healthy individuals—and is often driven by practical constraints such as expensive assays that limit focus to high-risk patients only.
Sampling bias, the fourth category, refers more broadly to who gets included in the study at all—it’s about missing entire groups of people, regardless of disease status. This can result from access barriers that prevent certain patient populations from receiving complete follow-up imaging, creating systematic gaps in data representation that can fundamentally skew model development and validation.
The image acquisition process introduces multiple layers of potential bias that can significantly impact model performance and generalizability. Simpson demonstrated this challenge through a striking example: Two CT images were taken of the same patient just 10 days apart at Memorial Sloan Kettering Cancer Center and
another institution using different protocols and dose reduction strategies. Despite both being portal venous phase CT scans, the images appeared dramatically different, illustrating how technical variations can introduce substantial bias into model training and testing.
To address these acquisition-related challenges, Simpson’s team conducted a prospective scan-rescan study involving patients from Memorial Sloan Kettering Cancer Center and MD Anderson Cancer Center, focusing on liver parenchyma imaging. The team added a second portal venous phase scan, varying the timing by plus or minus 15 seconds from the standard protocol. This approach recognized that contrast fluid administration creates dynamic liver attenuation changes over time, with images taken at different time points showing distinct characteristics due to varying contrast uptake patterns. These temporal variations represent just one example of the numerous technical factors that can introduce statistical bias into prediction models.
Beyond technical considerations, data acquisition can embed social biases that reflect and perpetuate healthcare disparities. Simpson highlighted the problematic history of race-specific correction factors in medical measurements, particularly the race-adjusted estimated glomerular filtration rate (eGFR) calculations that influenced kidney transplant eligibility for Black patients. While such adjustments have been discontinued for eGFR calculations, the historical data containing these biased measurements remain embedded in medical records and research datasets (Tsai et al., 2021; Vyas et al., 2020). This creates ongoing challenges for researchers on how to handle and compensate for these biased historical data in current studies.
The preprocessing stage introduces additional variability that can significantly impact model performance and reproducibility. Different filtering approaches, threshold selections, and discretization methods can yield substantially different results, even when applied to identical source data. Simpson emphasized that image features can vary dramatically based on discretization strategies, whether using fixed bin widths or fixed numbers of bins. This variability underscores the critical importance of detailed methodological documentation, which enables other researchers to replicate work and understand potential sources of variation.
Operator variability represents another significant preprocessing concern, as human factors influence both manual and semi-automated measurements, particularly in crucial tasks like image segmentation. These variations can introduce systematic biases that propagate through the entire modeling pipeline, affecting final model performance and reliability.
Perhaps the most critical risk in prediction model building is overfitting, which Simpson described as a model “capturing spurious associations in the training data in addition to associations that would be replicated in similar datasets.” To demonstrate the danger of overfitting, her team conducted experiments with publicly available imaging datasets and showed that essentially any data could be manipulated to support virtually any conclusion. This demonstration highlighted a fundamental principle: when datasets are sufficiently small relative to the number of variables being extracted, overfitting becomes inevitable, leading to models that can “predict anything” but generalize poorly. Even large datasets can quickly become effectively small when researchers ask highly specific questions, leading to the same overfitting problems (Chakraborty et al., 2024). This relationship among dataset size, question specificity, and model reliability represents a critical consideration in imaging AI development.
Simpson concluded her presentation with a pragmatic acknowledgment of data limitations and suggested balanced risk assessment. She emphasized that data are inherently “incorrect and biased,” regardless of quantity, with uncertainty embedded throughout. However, she argued that extensive knowledge about statistical bias provides researchers with the tools necessary to address many of these problems effectively.
Crucially, Simpson highlighted that the risks of biased data should be weighed against the risks of inaction. Significant risks are associated with leaving data “locked up in repositories for no one to use,” she noted, and substantial risks arise in avoiding AI altogether. This perspective reframes the bias challenge not as a reason to avoid AI development but recognizes the importance of thoughtful systematic approaches to bias identification and mitigation.
The key lies in implementing comprehensive bias mitigation strategies throughout the AI development lifecycle. This includes careful study design that acknowledges and accounts for potential biases, standardized data acquisition protocols that minimize technical variation, transparent preprocessing documentation that enables reproducibility, and robust model validation approaches that test performance across diverse populations and settings.
By embracing this systematic approach to bias identification and mitigation, the medical imaging community can harness AI’s transformative potential while maintaining the rigorous standards necessary for responsible healthcare innovation. Simpson’s framework provides a roadmap for navigating these challenges while ensuring that imaging AI models serve to enhance rather than compromise healthcare equity and quality.
Judy Wawira Gichoya, associate professor in the Department of Radiology and Imaging Sciences at the Emory University School of Medicine, discussed the various ways that AI models can take “shortcuts” to get to an answer, how that can lead to bias in the models, and ways to overcome that bias. In her estimation, the presence of shortcut is the biggest hurdle that should be overcome for AI to be used regularly in medical imaging. “These shortcuts are everywhere,” she said, and they will likely involve serious work to overcome.
Showing examples from Geirhos and colleagues (2020), Gichoya explained that shortcuts arise when AI models spot patterns for prediction but not for the right reasons. In one example, an AI model captioned a hilly terrain photo as containing grazing sheep when no sheep were present—the model made this “intelligent guess” because most similar training images contained grazing animals. In another medical example, an AI model examining lung images concluded that a patient had pneumonia not because of actual disease indicators, but because a radiographic marker showed the patient was from the intensive care unit, where pneumonia is common.
To illustrate how shortcuts develop, Gichoya described how a convolutional neural network (CNN) might confuse camels and cows. In datasets where cow images typically include grass backgrounds and camel images include sand backgrounds, the CNN might rely on background rather than animal features for classification. This becomes particularly problematic with limited datasets where background provides a strong signal, resulting in cows on sand being identified as camels and vice versa.
Gichoya discussed a 2018 study that examined CNN performance across three hospital systems—Mount Sinai Hospital, the National Institutes of Health (NIH) Clinical Center, and Indiana University Network (IU; Zech et al., 2018). Performance varied dramatically depending on training and testing data sources. Mount Sinai had 34.2 percent pneumonia prevalence versus 1.2 percent at the NIH Clinical Center and 1.0 percent at IU. Models trained on Mount Sinai performed well internally but generalized poorly to NIH and IU due in part to overfitting to high prevalence and hospital-specific features. CNNs frequently exploited shortcuts, such as hospital-specific artifacts, as proxies for disease prediction, and sometimes outperformed models trained to detect true pathology. This highlights the danger of “Frankenstein datasets”—that is, combining datasets without accounting for underlying characteristics. Gichoya emphasized the need to rethink dataset construction and testing, suggesting that even rural hospitals could contribute to AI testing with just 20 cases to identify model failures.
She went on to discuss how an AI model for COVID-19 prediction appeared accurate within their training hospitals but external validation revealed a significant drop in the model’s predictive capabilities (DeGrave et al., 2021). However, that same the model could predict patient sex with nearly 100 percent accuracy from chest images alone from both internal and external datasets, demonstrating a case of shortcut learning where the algorithm had learned to identify sex-related imaging patterns from the training dataset. As Gichoya noted, “If your data encodes specific care characteristics, the model is going to still figure it out and use it.”
Gichoya went on to provide a series of examples where AI models produced results from shortcuts. Research on hip fracture detection revealed models were good at predicting technical details like what type of X-ray machine was used (perfect accuracy) and hospital procedures, but when researchers removed these shortcuts by testing on more balanced patient groups, the model’s ability to actually detect fractures dropped to essentially random guessing. This shows the AI was mainly learning to recognize hospital patterns and equipment differences rather than the bone fractures it was supposed to identify. (Badgeley et al., 2019). Gichoya highlighted this example to warn that high statistical scores (such as an Area Under the Curve of 0.99) may trigger error investigations rather than celebration.
Examining knee X-ray images, Gichoya and her students could predict hospital origin based on slight differences in metal rod placement—details radiologists would not notice (Hill et al., 2024).
She then discussed how foundation models suffer similar shortcomings. Deep learning models in chest radiography exhibited sex- and race-related bias across subgroups (Glocker et al., 2023). Vision language models in histopathology achieved near-perfect accuracy when text labels were accurate but performed randomly without labels and failed completely with misleading labels (Clusmann et al., 2024).
Addressing claims that shortcuts only affect poorly developed software, Gichoya described an experiment with FDA-approved commercial osteoarthritis assessment software (Lenskjold et al., 2024). When researchers flipped knee images so right knees appeared as left knees, the model produced different severity grades for identical anatomical structures, indicating reliance on position-based shortcuts rather than actual joint pathology. This occurs because training datasets encode real-world patterns—such as, right knees typically show more wear due to dominant leg usage—that models learn as predictive features.
Gichoya’s team demonstrated that standard deep learning models can identify patient race from chest X-rays, CT images, and mammograms with high accuracy (Gichoya et al., 2022). After 3 years, the mechanism remains unknown and isn’t explained by imaging-related race surrogates like body mass index or disease distribution. This racial inference capability could unpredictably bias AI interpretations.
She ended by stating that given AI’s susceptibility to shortcuts, the field should fundamentally rethink dataset construction, especially for multimodal modeling. “We need to think about testing more than training,” Gichoya emphasized, “because if foundation models that see more data are still susceptible to the same pitfalls, then we have to change something.” The field could make model auditing easier and identify failure points systematically. Without addressing these challenges, she said that “AI is not going to work for anyone.” She stated that the pervasive nature of shortcuts—from simple background associations to complex institutional patterns—demonstrates that robust AI involves not only better models but also fundamentally different approaches to data curation, testing, and validation across diverse real-world contexts.
The discussion session following the presentations on ethics was moderated by Chiang and Costes. Costes asked the panelists about classifying data by level, noting that at the National Aeronautics and Space Administration, people classified data into different levels—that is, raw data, processed data, and so on. He wanted to know how this level of classification might be helpful in medical imaging and nuclear regulation.
Simpson responded that raw CT images in sinogram space hold about 1 terabyte of data each, making them impossible for humans to interpret easily. “Those images are typically deleted from a CT scanner within a few days or a week because there just isn’t space to store them,” she said. This creates concern for doctors who want to see what they are working with, although attempts at using raw imaging data have proven challenging.
Pisano provided an example from a trial following 108,000 women at 133 sites worldwide, explaining different levels of evidence that a woman does not have breast cancer. A clean mammogram represents the
highest level of evidence, followed by physical examination by a physician, then asking the woman directly, with the lowest level being absence from a cancer registry. She noted that unlike physics, where measurements and calculations provide comfortable truth, breast imaging has no absolute answers. Even a double mastectomy with pathology showing no cancer evidence isn’t foolproof, she said, as “even the pathologist can miss things.”
Dennis said nuclear plant regulation faces similar evidence-level issues. For weld inspections, debate continues over whether synthetic data generation and physics-based modeling suffice or if expensive experiments are needed. “My thought on this is for us in the nuclear industry, there is a graded approach of using different levels of data depend[ing] upon the risk profile that’s being applied,” he said. Gichoya agreed that thinking in terms of different data types or levels is useful, starting with CT scanner data but recognizing that anonymized data and public domain data represent different levels due to aggregation or metadata removal. Synthetic data and foundation models with embeddings create additional levels, introducing “this concept of models as a dataset,” she explained.
Costes elaborated on his question by describing how his students sometimes autoscale microscopy images to see them better, which can be disastrous. When looking at DNA damage in cells where sometimes no damage exists, autoscaling brings up everything in the nucleus instead of maintaining consistent levels. This human-introduced bias, often unconscious, causes training failures. That is why microscopy researchers return to raw, untouched, full-bit-range images; “You never put any human interpretation in the image,” he stressed.
Costes brought up classic COVID-19 cases in which AI studies mixed European and American image data. One country had many ventilator cases while the other didn’t, with each image having different methods for labeling and for what case was captured in the image. “So the training was perfect,” he said, “but it turns out that the training was on the label of the image, not the actual medical case.” He shared lessons about poor training dataset design and the potential for explainable AI (XAI) to avoid such problems by explaining model classification methods.
Gichoya noted many examples of problems with “Frankenstein datasets,” in which different datasets with varying characteristics are combined into large training datasets, making them challenging to curate and use downstream. However, she cautioned that XAI may not be the complete answer, referencing Jayashree Kalpathy-Cramer’s work showing that “if you even change the order of how you bring the chest X-rays, the XAI is inconsistent.”
Dennis said standards and guidance can help by increasing reproducibility. Unlike medical imaging, where radiologists vary in their image-taking approaches, the nuclear industry is highly regulated and prescriptive about protocols for examining welds with computer vision. “That does create a slightly cleaner dataset that is reproducible among a variety of applications and use cases,” he said.
Pisano suggested that this explains why breast imaging has become an AI development target—it is the most standardized radiology field, with images used to look for just one disease. Reporting standards exist for breast imaging and are being developed for other body areas, with image acquisition becoming relatively standardized throughout radiology.
Chiang introduced responsibility and accountability questions: “When an AI model makes a significant mistake with a negative outcome to the patient, for example, who is responsible? Is it the AI model developer who may be introducing all the biases in the model, and therefore the model is not very useful, or is it the . . . human who makes the final decision, the doctor who makes the final decision . . . versus the software developer who actually launches software for the doctor to use?”
Gichoya distinguished between AI not working and AI being biased, where bias means “your AI does not work equally for everyone.” When AI makes significant mistakes with humans in the loop, burden tends to fall on that person, particularly since medical practitioners can get malpractice insurance. She referenced FDA guidance warning that the agency cannot do all necessary testing to ensure AI model safety, which means that
“healthcare institutions that are deploying these models should be assuming responsibility and governance to make sure that [they are] safe.” Ultimately, responsibility still falls on the human in the loop.
When Simpson asked whether companies creating AI applications should be held accountable, Pisano mentioned companies willing to take liability for people whom AI labels as negative for a disorder. “But I’ll just point out [that] if they take all the negatives, it is a completely different job for the radiologist,” she said. “And it is going to be a much harder job because every case is something you have to scrutinize and decide what to do about [it].” Gichoya commented that radiologists become good at their job by learning negatives; she questioned what happens when they no longer have those to learn from.
Dennis said that from the NRC perspective, people and companies operating the technology are responsible entities even with AI use. Neither nuclear power nor medical imaging is ready for fully autonomous AI. “So, even with agentic AI,” he said, “we’re in this weird in-between where we still have to, for a lot of these safety critical applications, have a human who is double checking that.”
Pisano responded that human-in-the-loop necessity depends on the task. Safety-critical nuclear tasks require human checking, but some medical tasks may not. She cited breast density rating on a four-point scale as an example where machine misclassification would not be catastrophic, indicating that “you could probably trust that task to a machine without terrible damage to patients.”
Costes raised the question of whether fully autonomous AI systems might become viable, given the industry’s push toward developing comprehensive safeguards. The panelists’ responses revealed deep skepticism about replacing human expertise entirely. Gichoya steered the conversation away from autonomy altogether, arguing instead for a more nuanced approach that leverages human strengths while addressing human limitations. She pointed to cancer detection as an ideal example, where the real challenge lies in those ambiguous cases in which masses fall into gray areas between clearly benign and clearly malignant. Rather than building systems to replace radiologists, she suggested that developers would create more value by understanding how radiologists actually work and designing AI to complement their existing workflows.
Simpson reinforced this perspective by highlighting something that current AI simply cannot replicate—the intuitive pattern recognition that expert radiologists develop through years of practice. She described how highly trained specialists can spot subtle abnormalities that would escape both novice eyes and machine algorithms—a kind of professional intuition that emerges from deep experience and that remains uniquely human.
Dennis brought a regulatory perspective to the discussion, acknowledging that the nuclear industry does see opportunities for AI deployment, particularly in lower-risk applications. Even in more critical situations, he explained, AI could potentially operate within carefully designed safety frameworks, similar to how the industry already uses passive safety systems and redundant controls to maintain security even when primary systems fail.
Pisano switched to discussing the economic realities of AI development. While she agreed that starting with simpler applications made sense, she observed that such “low-hanging fruit does not attract the investment” that more ambitious projects do. Investors, she noted, tend to favor comprehensive solutions like tests that can detect all cancers rather than more modest innovations targeting specific types of cancer, even when the latter might be more technically feasible and clinically useful.
Gichoya expanded on this theme, cautioning against the assumption that every healthcare challenge requires a technological solution. She emphasized the importance of examining entire systems rather than focusing narrowly on what individual algorithms might accomplish and suggested that many improvements might come from better integration of existing tools rather than revolutionary new capabilities.
To end the discussion, the moderators turned to questions regarding training future workforces given AI’s rapidly growing power. Simpson noted her computer science department’s intense focus on training approaches since AI can now generate code and replace programmers. “What does AI mean now for every discipline?” she
asked. “What does it mean to be literate in AI? And how do we take those tools and those thoughts across all of these disciplines?”
Dennis said even with AI, subject matter experts highly trained in specific areas remain necessary, with AI as a layered tool. Subject matter experts may benefit from questioning mindsets and familiarity with AI for comfortable AI usage. “I will say from my own experience at our agency in the past 3 years, people have gone from being naysayers of AI to wanting to adopt it to being more comfortable with it, so I think we’re in a transition place,” he said.
Simpson said current undergraduates and new graduate students are “built differently.” They take many online courses, learn across various areas, and do not see traditional disciplines as discretely as previous generations. “They are just getting so much more integration across all of what were historically silos,” she said, predicting that in 10 or 20 years, many disciplines will become “one big discipline because they’ve figured out what this has to look like.”
Badgeley, M. A., J. R. Zech, L. Oakden-Rayner, B. S. Glicksberg, M. Liu, W. Gale, M. V. McConnell, B. Percha, T. M. Snyder, and J. T. Dudley. 2019. Deep learning predicts hip fracture using confounding patient and healthcare variables. npj Digital Medicine 2:31.
Chakraborty, J., A. Midya, B. F. Kurland, M. L. Welch, M. Gonen, C. S. Moskowitz, and A. L. Simpson. 2024. Use of response permutation to measure an imaging dataset’s susceptibility to overfitting by selected standard analysis pipelines. Academic Radiology 31(9):3590–3596.
Clusmann, J., S. Schulz, D. Ferber, I. Wiest, A. Fernandez, M. Eckstein, F. Lange, N. Reitsam, F. Kellers, M. Schmitt, P. Neidlinger, P.-H. Koop, C. Schneider, D. Truhn, W. Roth, M. Jesinghaus, J. Kather, and S. Foersch. 2024. A pen mark is all you need—Incidental prompt injection attacks on vision language models in real-life histopathology. medR xiv. https://doi.org/10.1101/2024.12.11.24318840.
CNSC (Canadian Nuclear Safety Commission), U.K. Office for Nuclear Regulation, and U.S. Nuclear Regulatory Commission. 2024. Considerations for developing artificial intelligence systems in nuclear applications. https://www.nrc.gov/docs/ML2424/ML24241A252.pdf.
DeGrave, A. J., J. D. Janizek, and S. I. Lee. 2021. AI for radiographic COVID-19 detection selects shortcuts over signal. Nature Machine Intelligence 3:610–619.
Geirhos, R., J. H. Jacobsen, C. Michaelis, R. Zemel, W. Brendel, M. Bethge, and F. A. Wichmann. 2020. Shortcut learning in deep neural networks. Nature Machine Intelligence 2:665–673.
Gichoya, J. W., I. Banerjee, A. R. Bhimireddy, J. L. Burns, L. A. Celi, L. C. Chen, R. Correa, N. Dullerud, M. Ghassemi, S. C. Huang, P. C. Kuo, M. P. Lungren, L. J. Palmer, B. J. Price, S. Purkayastha, A. T. Pyrros, L. Oakden-Rayner, C. Okechukwu, L. Seyyed-Kalantari, H. Trivedi, R. Wang, Z. Zaiman, and H. Zhang. 2022. AI recognition of patient race in medical imaging: A modelling study. Lancet Digital Health 4(6):e406–e414.
Glocker, B., C. Jones, M. Roschewitz, and S. Winzek. 2023. Risk of bias in chest radiography deep learning foundation models. Radiology: Artificial Intelligence 5(6):e230060.
Hill, B. G., F. L. Koback, and P. L. Schilling. 2024. The risk of shortcutting in deep learning algorithms for medical imaging research. Science Reports 14(1):29224.
Lehman, C. D., R. D. Wellman, D. S. Buist, K. Kerlikowske, A. N. Tosteson, D. L. Miglioretti, and the Breast Cancer Surveillance Consortium. 2015. Diagnostic accuracy of digital screening mammography with and without computer-aided detection. JAMA Internal Medicine 175(11):1828–1837.
Lenskjold, A., M. W. Brejnebøl, M. H. Rose, H. Gudbergsen, A. Chaudhari, A. Troelsen, A. Moller, J. U. Nybing, and M. Boesen. 2024. Artificial intelligence tools trained on human-labeled data reflect human biases: A case study in a large clinical consecutive knee osteoarthritis cohort. Science Reports 14(1):26782.
Moskowitz, C. S., M. L. Welch, M. A. Jacobs, B. F. Kurland, and A. L. Simpson. 2022. Radiomic analysis: Study design, statistical analysis, and other bias mitigation strategies. Radiology 304(2):265–273.
NRC (Nuclear Regulatory Commission). 2023. Artificial intelligence strategic plan: Fiscal years 2023–2027. NUREG-2261. Washington, DC: Nuclear Regulatory Commission. https://www.nrc.gov/reading-rm/doc-collections/nuregs/staff/sr2261/index.html.
NRC. 2024. Advancing the use of artificial intelligence at the U.S. Nuclear Regulatory Commission. SECY-2409935. https://www.nrc.gov/docs/ML2408/ML24086A002.pdf.
Pensado, O., P. LaPlante, M. Hartnett, and K. Holladay. 2024. Regulatory framework gap assessment for the use of artificial intelligence in nuclear applications. Southwest Research Institute paper prepared for the Office of Nuclear Regulatory Research, Nuclear Regulatory Commission. https://www.nrc.gov/docs/ML2429/ML24290A059.pdf.
Tsai, J. W., J. P. Cerdeña, W. C. Goedel, W. S. Asch, V. Grubbs, M. L. Mendu, and J. S. Kaufman. 2021. Evaluating the impact and rationale of race-specific estimations of kidney function: Estimations from U.S. NHANES, 2015–2018. ClinicalMedicine 42:101197.
Vyas, D. A., L. G. Eisenstein, and D. S. Jones. 2020. Hidden in plain sight—Reconsidering the use of race correction in clinical algorithms. New England Journal of Medicine 383(9):874–882.
Zech, J. R., M. A. Badgeley, M. Liu, A. B. Costa, J. J. Titano, and E. K. Oermann. 2018. Variable generalization performance of a deep learning model to detect pneumonia in chest radiographs: A cross-sectional study. PLOSMedicine 15(11):e1002683.