This session focused on trustworthiness and transparency of artificial intelligence (AI) models examined from multiple angles. The session was moderated by Anyi Li, Memorial Sloan Kettering Cancer Center, and Ceferino Obcemea of the National Cancer Institute.
Matthew Rosen, associate professor of radiology at Harvard Medical School, discussed how machine learning and (ML) AI enable new classes of imaging devices, specifically low-field magnetic resonance imaging (MRI). A key focus of his discussion was to address uncertainty quantification and trustworthiness in AI-driven MRI, including methods to parameterize reconstruction accuracy, assess model generalizability, and identify failure cases. He highlighted both the promise and the limitations of ML in portable MRI, as well as addressing challenges in robustness, interpretability, and clinical deployment.
Twenty years ago, Rosen built his first low-field MRI using a 6.5-millitesla electromagnetic scanner instead of the powerful superconducting magnetic scanners found in commercial MRI machines. His initial goal was creating an MRI scanner capable of imaging patients both standing up and lying down, driven by his and student Leo Tsai’s interest in studying how gravity affects the heart–lung system. Later discussions with the Department of Defense revealed broader applications—the portable low-field MRI could be deployed on battlefields because its weak magnetic field would not tear shrapnel from soldiers’ bodies.
Today, low-field MRI is gaining acceptance among clinicians and professional societies as a potentially valuable tool. Dependent on ML and AI, Rosen’s work focuses on MRI applications that genuinely benefit from low-field operation.
Commercial MRI uses high-field superconducting magnets for good reason—this represents the most straightforward path to high-quality images. The magnetic moments of proton ions in water molecules, which MRI manipulates to create images, are vanishingly small. As Rosen noted, “The only thing you can do is turn up the magnetic field to make nice pictures.”
However, Rosen and his students developed an alternative approach to improving low-field MRI images. They combined physics strategies—high-efficiency sampling strategies and low-noise detectors—to maximize the signal-to-noise ratio (SNR) in the detector and then applied computational strategies, including magnetic
resonance fingerprinting and deep learning reconstruction, to extract more information from images. Deep learning strategies became essential to this process.
Eventually, Rosen’s laboratory reached the Johnson noise limit on its detectors with maximally efficient signal acquisition strategies, making further signal improvement seem impossible. Postdoc Bo Zhu then asked, “Can we just somehow delete the noise?” This question launched Rosen and his team’s work on noise reduction. When an MRI scans a brain or other body part, it acquires data that get “reconstructed” into clinically recognizable images through an inverse Fourier transform. Multiple reconstruction methods exist, but reconstructed images invariably contain noise that can make recognition difficult.
Rosen’s team drew inspiration from biological vision systems. “Our biological neural networks have been trained since birth on what makes an image—edges, patterns, [and] texture—versus noise,” he explained. Human perception refines through exposure to and training on stimuli (Sasaki et al., 2010), with this perceptual learning proving critical for robust performance in low SNR settings (Lu et al., 2011).
Building on this learning-to-recognize-images concept, Zhu and others (2018) developed AUTOMAP (Automated Transform by Manifold Approximation). “The idea is basically to learn to invert an arbitrary forward model,” Rosen said, “and it does it by identifying sparsity in both domains—that is, the signal domain and the image domain—to improve SNR and accuracy.”
Sparsity provides a way to separate signal from noise. Natural images have special structure, and high-dimensional data can be represented with fewer coefficients in sparse domains. While images are not sparse in the image domain or Fourier domain, they become sparse when transformed to domains like the wavelet domain. AUTOMAP operates between data-defined sparse domains, hallucinating images from learned sparse convolutional feature maps. He continued by explaining that AUTOMAP takes raw case-based data on one side and produces reconstructed images on the other side. Importantly, AUTOMAP-produced images have SNRs two to three times higher than conventional reconstructions.
When applied to noisy low-field MRI data, AUTOMAP boosts SNR by factors of 1.5 to 2.5, with greater boosts at lower SNRs (Koonjoo et al., 2021). The team discovered that AUTOMAP removed “zipper artifacts” (caused by scanner issues creating alternating white-to-black pixels) from images despite not being specifically trained for artifact removal.
Since this AI system learns reconstruction by seeing many examples and hallucinating final images, similar to other AI applications, concerns arise about potentially missing tumors. However, Rosen noted that the particular mathematical problem AUTOMAP solves during reconstruction “is particularly immune to this kind of issue.”
A study found that low-field MRI augmented with “super-resolution” AI worked as well for segmentation as high-field MRI (Iglesias et al., 2023). While acknowledging that such images can have problems—one study showed the super-resolution low-field technique creating nonexistent pathology—Rosen cautioned about diagnostic use: “If you’re trying to use these super-resolution images for diagnostic purposes, you had better be careful.” However, he added, “these tools in fact have their place,” as demonstrated by segmentation work. Success depends on careful task selection and result verification.
Rosen continued by highlighting the work by Sorby-Adams and colleagues (2024), which added super-resolution and quantitative morphological measurements at low field. Portable scanners placed in Massachusetts General Hospital’s memory unit scanned patients brought in by caregivers who noticed cognitive impairment. Calculated brain volumes of hippocampal regions, ventricular regions, whole brain, and white matter hyperintensity lesions proved comparable between low-field scanners with super-resolution and high-field scanners.
In imaging, numerous methods exist for forward encoding objects. X-ray computed tomography typically uses centric or radial encoding, while MRI employs Cartesian, spiral, and radial encoding—methods chosen
because of familiarity with Fourier transforms. But better approaches might exist. Since AUTOMAP can learn to convert arbitrary forward encoding, Rosen suggested it could develop new forward encodings for imaging applications. “Maybe we could change the world or cut scan time by 50 percent or better,” he said.
Drawing inspiration from Google DeepMind’s approach to teaching AI to play Tetris—not by teaching rules but by providing game scores and asking for maximization—Rosen’s team applied model-free reinforcement learning to forward encoding used in MRI scanner image acquisition. The team built an MRI simulator with an AI agent tasked with comparing ground truth digital phantoms to their encoded and inverse-transformed results. This approach yielded interesting slice selection methods and led to experiments where AI algorithms designed new effective pulse sequences for nuclear magnetic resonance.
Rosen ended his talk stating that MRI is now clearly possible in the low-field, millitesla regime through combining physics, computation, and deep learning. AUTOMAP provides a unified reconstruction framework trained on forward models, learning to solve arbitrary inverse problems while boosting SNR and image quality. Super-resolution plus segmentation can provide accurate quantitative morphological measurements, and his group has developed AI-discovered pulse sequences for quantitative magnetic resonance.
Roozbeh Jafari, principal staff member in the Biotechnology and Human Systems Division at the Massachusetts Institute of Technology’s Lincoln Laboratory, offered a case study of blood pressure modeling using physics-informed neural networks (PINNs) and ways to consider trustworthiness of the model.
He began by talking about digital twins, which are virtual representations of the human body or parts of it. They get information about the body through sensors and then model what is happening in the body with such things as ML algorithms. They can be used to predict what is going to happen. They are similar to simulations, Jafari said, but, unlike simulations, they have to remain tightly coupled with what they are representing through a steady flow of information.
Referring to the saying that “all models are wrong, but some are useful,” Jafari shared the following key questions concerning a digital twin: “What level of granularity are we going to use? How are we going to construct the models? How do we make sure that our models are actionable? How can we make sure that the models provide trustworthy information to us?”
Jafari believes that the models used in digital twins should be built from images or other high-fidelity data. In building a model, Jafari said, one problem that often arises is that one can understand the physics of the model but cannot really estimate all the hidden parameters of the model. One approach is to use scientific machine learning (SciML), a type of ML that, instead of learning the patterns in data, learns the laws of physics. “Can I use SciML to serve as a proxy for this model?” he asked. “That’s what this talk is going to explain.”
By learning the rules of physics, SciML models have several advantages over the more typical AI that focuses on patterns in data, Jafari said. For one, they can make accurate predictions in situations where not enough data exist for standard AI models to do so. SciMLs establish physical consistency and constraint satisfaction in a model, making it possible to quantify uncertainties. Additionally, SciMLs are generally interpretable and explainable, and are often generalizable and robust. They also have various properties that make a model more trustworthy, such as helping with data quality and bias mitigation as well as domain-specific validation and benchmarking.
The premise of a PINN, which is a type of SciML, is that it incorporates physical laws in learning from data, Jafari said. Thus, unlike a convolutional neural network, when a PINN sets out to find the best curve that explains data, it also has the constraint that the curve is dictated to obey physical laws.
In the remainder of his talk, Jafari spoke about his team’s work to develop a digital twin that captures blood flow and pressure in an individual. The system consists of a wearable sensor that captures blood flow–related
data from an individual plus a PINN that is knowledgeable about various physics equations describing the flow of blood through the bloodstream. In particular, Jafari characterized his team’s contributions as personalizing the model to better reflect the characteristics of different individuals, building interpretability into the model (and thus increasing its trustworthiness), and developing ways to determine uncertainty in the data and the model.
As context, Jafari said that blood pressure is the “Holy Grail in medicine.” It is the most important metric that cardiologists use, he added, and it can provide important information about risks to cardiovascular health. Jafari then discussed some basic information about blood pressure, beginning with the fact that the heart pumps blood through the body at a particular rate and a particular volume per beat, called the stroke volume. The product of the rate and stroke volume is the cardiac output—or, in essence, the power of the heart.
Artery walls have a certain stiffness that determines how much resistance the arteries have to the blood pumped from the heart. The blood pressure is a function of this stiffness. Ideally, one would like to be able to measure the stiffness of the arterial wall directly, but in practice one measures blood pressure, which, when graphed over time, results in a waveform where the minimum is the diastolic blood pressure and the maximum is the systolic pressure. Measuring this waveform is challenging, Jafari said, but the waveform contains information that would be diagnostic.
The blood pressure waveform can be approximated by the so-called Windkessel model, which is similar to models of electric circuits where an artery’s resistance to accepting blood correlated with electrical resistance and the artery’s compliance correlated with electrical capacitance. Using the Windkessel model, one can write and solve equations representing blood flow for a general reference physiological system. However, for the model to be actionable, Jafari said, it needs to be personalized, with its parameters corresponding to the arterial features of a given individual.
Jafari’s team uses two different variants of the Windkessel model to model blood pressure. The two-element model uses arterial compliance and peripheral resistance, while the three-element model adds characteristic impedance. Both models have equations that are essentially identical to the physics equations describing electrical flow in a circuit with capacitance, resistance, and, in the latter case, impedance.
To illustrate how his team uses the Windkessel model to set up a PINN, Jafari provided an example using the three-element model. The model applies the penalty function into the loss function and take into account some of the characteristics of the participants—age, sex, height, and weight. The loss function has two components, he said. One is related to the data, “because you’re using data to train the network and you try to minimize the error,” while the other part is related to the physics; “you optimize your neural net based on the objective function.”
The data the team uses in the model are collected via bioimpedance, with four-point sensors the team has developed injecting current into the skin and measuring the impedance. This makes it possible to measure blood pressure not only in the arteries but also in the capillaries. To use the model with an individual, the team induces changes in blood pressure by having the person exercise, for instance, or putting a hand in an ice bucket to cause vasoconstriction. The team also continuously measures the person’s overall blood pressure using a phenol press machine that gives continuous measurement.
To train and then test their PINN, the team members worked with systolic blood pressure, which, in the example Jafari showed, ranged from 100 to 140 as the person’s blood pressure was manipulated. To train the model, they used only data when the blood pressure was relatively high (above around 125) or relatively low (below 110). “The assumption was that if the model learns the physics of it, it should do a good job for the middle, too, because it looked at the two extremes,” Jafari explained. And the model did perform well when tested on data from the middle, having a smaller error than a standard convolutional neural network. “What is more important,” Jafari added, “is that it works better for extreme conditions, and that’s because we learned the physics of it.”
In situations involving disorders, chronic stress, or injury, such as hemorrhaging from a combat injury, blood flow goes into a different mode. In the case of a wound, for instance, blood is lost and cardiac output goes down, but then the heart begins to compensate, blood vessels constrict, the heart rate increases, and blood pressure goes back up. And what really remains elevated is the peripheral resistance, Jafari said. “That’s the metric that I want to measure,” he said. “I do not want to measure blood pressure because blood pressure is not giving me causation; it is giving me correlation.”
The approach that Jafari’s team has developed makes it possible to measure such things as peripheral resistance and arterial compliance. “In terms of understanding the interpretation of this,” he said, “you could look into things like the characteristic impedance with the blood pressure in the presence of a specific maneuver” such as applying cold or pain. “You would be able to characterize this,” he asserted.
And the interesting part of this, Jafari added, is that when the model is working well, the confidence on the physics side is going to be higher, which provides an indication of the trustworthiness of the model. “We need more work to be done on this topic,” he said, “but we are going in the direction of showing some of the confidence coming directly from the physics of the neural net.”
In closing, Jafari listed several things that still need to be done. He would like to model the different sources of noise, such as noise in the bioimpedance, and to have a better understanding of the uncertainties associated with estimating the parameters and with the hemodynamics. He would also like the PINN to be more nonrestictive—for example, to figure out how the parameters change from beat to beat.
Then, offering some takeaway messages, he said that real-time updating of the virtual representation for the digital twin is a key feature of his work. Prediction is a critical feature for digital twin applications and real-time decision support. Traditional data-driven AI models struggle with extrapolation beyond observed data and may violate physical constraints. And SciML and PINNs play a critical role, ensuring that digital twins remain trustworthy, physically consistent, and data-efficient.
Heidi Hanson, senior scientist and group lead of biostatistics and biomedical informatics at Oak Ridge National Laboratory, discussed the sources of uncertainty in indoor radon exposure estimates and what that uncertainty means for radiation epidemiology. As context, she said that radon exposure is the second leading cause of lung cancer in the United States, but it is not factored into any recommendations for patients to undergo screening, in large part because it is so hard to measure exposure to radon.
In particular, she said, her focus would be on the data used to estimate indoor radon exposure as well as on feature engineering and how that might affect exposure estimates. Currently, she said, radon estimates are based mainly on the location of people’s residences since much of an individual’s radon exposure comes from radon gas leaking up from the ground into houses, with basements being the places with greatest radon levels. However, Hanson noted, information on radon levels within homes is not available across the entire United States, so developing models for estimating this residential exposure is necessary. Unfortunately, such models are based on data with a lot of error. “So, yes, I can do very cool things,” she said. “I can do machine learning algorithms [and] I can get awesome answers, but are those awesome answers right?”
Hanson talked about two different datasets. One has point-level data for approximately 60,000 residences in the state of Utah. All of the measures are pre-mitigation—that is, before anything was done to lessen the amount of radon in the home—and they were all taken from the basement or bottom level. For this dataset, the data are considered point-level because the actual address where each test was taken is available.
The second dataset contains individual-level estimates with zip code for approximately 720,000 residents of Pennsylvania. These data are also pre-mitigation and taken from the basement or bottom level of a home. The
main difference from the Utah dataset is that while the individual readings were taken from homes, only the zip codes are available (not the addresses for those homes). The fact that the exact point locations are not available leads to problems, Hanson said.
Several factors can lead to uncertainties in the modeling of radon exposure, Hanson said, including such familiar factors as temporal variation, spatial variation, and measurement variation. But, she continued, she focused on the choices that are made in geoprocessing the data for a model and how those choices lead to uncertainty as well.
In creating a prediction model, Hanson explained, one needs to use various sorts of data, such as soil concentrations of radon, information about housing characteristics, information about the water table, elevation, and so on. Unfortunately, she said, these data come in many different shapes and sizes: “So I have a really big problem stacking different polygons—not everything comes into the point location.” To deal with that, people generally aggregate up to the lowest common unit. For instance, since the exposure estimates in Pennsylvania are available only at the zip code level, she would take all of the more detailed information and aggregate it to the zip code.
Using the Utah dataset to illustrate, Hanson began by explaining how she does spatial indexing. The dataset has point locations for every house with a radon measure, and those appear on a map as individual dots. However, she said, she does not have similar location information for all of her data, so to get her data to a common unit she overlays a hexagonal grid over the entire space. In that way, she said, “I can always use the lowest level of information that makes sense and tie everything to one of those hexes. . . . I’m scaling up and down depending on the dataset that I’m using, but these grids allow me to do this. So this is truly just a data engineering trick.”
When it makes sense, Hanson continued, she can aggregate multiple hexagons getting values for the new combined hexagons by using weighted averages of the values of the individual hexagons. Such aggregation has various weaknesses, and Hanson described one technique—population masking—that can be used to mitigate some of those weaknesses. For example, she said, there are various places in Pennsylvania where no one lives, so no home radon test values are available in those areas. So if she wants to run an analysis comparing the characteristics of the soil—particularly the percent of clay—against home radon values, she does not want to include soil information for those areas where people do not live. The answer is to take the map of soil characteristics across Pennsylvania and then “population mask” it. “I mask out everything that has no possible chance that I’m going to get indoor home radon test value from that area,” she said. “Because we use this spatial indexing, we can easily stack and get the information that we want across all the different layers that we’re putting in.”
Aggregating spatial data makes it impossible to see all of the heterogeneity in the data, Hanson noted, and it also hides some of the bias. For instance, in Salt Lake County, Utah, areas of lower socioeconomic status do not generally have many home radon detectors, so little or no data exist in these areas. More generally, she said, she knows her data have various issues that she may not be able to solve, but because she does not want that to stop her from learning more about her data and creating better models, she works with what she has. But to do that, she needs to characterize the uncertainty in the data, which she does with what she called a triangulation method. “I know there are things I’m not going to be able to correct,” she said, “but I want to use multiple models so I can have an in-depth understanding of what I’m really looking at.”
Her triangulation method involves looking at three things—the average model, the quantile regression, and a volatility model—and combining the results to get a better idea of what is happening in her data. She began by describing an average model at the zip code tabulation area (ZCTA) level, noting small differences between ZCTA and zip code that were not important for the discussion. Using the random forest method, she models the average radon concentration and then she models the volatility of the estimates, coming up with the permutation importance of the top 10 variables for the average model. Not surprisingly, the model feature that ranked highest in importance was the soil permeability.
Showing a map of Pennsylvania indicating the average radon levels and coefficient of variance of radon concentration by ZCTA, Hanson pointed out those areas with very high variance. If one only looked at average levels, one would miss some of the places with high levels of radon exposure in areas smaller than the ZCTA. “So, by averaging I really am missing some things in my data,” she said. “That is concerning if I’m trying to come up with an estimate that I want to use for precision health.”
Examining her variability model, Hanson then identified the importance of the top 10 variables in explaining the model’s variability. This helps one predict where the most variation will take place. “This is important,” she said, “because it can tell me what areas I should not trust when I’m looking at those average level predictions.”
Then she noted that by using a quantile regression forest—so now predicting the distribution rather than just looking at the average—she gets even more and different insights into the data. One takeaway, she said, is one gets very different results when looking at the fiftieth than when looking at the average. “So when I want to know what somebody in a zip code has been exposed to and incorporate it into the clinic,” she said, “I should be very cautious when there’s a lot of variation in my data.”
Moving on from discussing errors in concentration, Hanson briefly touched on errors in measuring exposure. No one spends all of his or her time at home in the basement, she noted, so getting a better idea of individuals’ exposure based on their movements throughout the day is important. Her laboratory is working on some ways that agent-based models can be used to simulate this. “This just helps me understand the exposure part of this at a much more detailed level than residential location,” she said.
Jie Yu, head of digital and data science product management at Johnson & Johnson, described his company’s efforts to use AI in building a more resilient global healthcare supply chain. His talk addressed how his company has developed an AI model-based platform to quantify different types of healthcare supply chain risks and simulate various scenarios of risk propagation and impact throughout the entire supply chain network—from raw materials through intermediates to finished products.
Since the COVID–19 pandemic, Yu said, the healthcare sector and most other sectors have experienced many challenges in the global supply chain network. For instance, a study Johnson & Johnson did with the consulting firm KPMG found that more than 75 percent of global healthcare organizations report having experienced supply chain disruptions over the past few years. Furthermore, Johnson & Johnson has estimated that a global or regional supply chain disruption can cost a company of its size roughly $500 million to $1 billion. Yu described the global events contributing to supply chain disruptions over the past few years.
Thus, improving the global supply chain has become a priority at Johnson & Johnson, Yu said, and over the past few years the company has developed a cross-sector, cross-functional global initiative to try to build a more resilient supply chain through the use of digital, AI, and ML capabilities. To do that, Yu continued, the company is trying to build more dynamic, nimble, and digitized intelligent solutions to manage risk factors proactively that may affect the global healthcare supply chain system. Johnson & Johnson chose to take advantage of the current AI and ML technologies to build a system that can examine supply chain risk factors and vulnerabilities and make recommendations about how to mitigate those risks not only to the company’s different business units but also to the upstream suppliers, vendors, and external manufacturers in its network.
He then detailed how Johnson & Johnson is doing this, beginning with getting a clear idea of exactly what risks the company might be facing in case of supply chain disruption. To do this, it looks at “adjusted value at risk,” which Yu described as “a powerful measure of vulnerability at all levels—node, product, portfolio, [and] network.” The adjusted value at risk can be thought of as the product of the residual risk exposure (i.e., the likelihood of something happening) and the value at risk. Each of these takes into account a variety of factors. For instance, to estimate the residual risk exposure, one looks at geographic factors such as global security and cli-
mate issues as well as supplier factors and factors internal to Johnson & Johnson. To calculate the value at risk, one considers the time to recover, the time to survive (essentially how long the inventory on hand would last in case of a disruption of supply chains), and the effect on sales. These two factors, the residual risk exposure and the value at risk, are multiplied to get the adjusted value at risk, which, Yu said, “is a quantitative measure of the vulnerability of your entire supply chain network.”
Yu next described the digital platform that Johnson & Johnson has built in this initiative. It has three basic parts, he said. The first looks into value streams—essentially the materials from raw materials to finished goods and then on to the distribution network and the customers. These value streams are captured in digitally rendered maps, he said, “so we have all [those] structured and unstructured data to be able to use for the next level of AI and machine learning and data science modeling work.”
The core part of the platform, Yu said, is a vulnerability detection technology that uses an AI and ML model to quantify the vulnerabilities. On top of that, he added, “we do different kinds of scenario simulations to simulate when different disruptive events occur and what kinds of consequences they might have.” Finally, the company develops risk mitigation plans based on the different scenarios that have been simulated and develops prescriptions on which investments to prioritize and sequence.
Showing a complicated diagram of the architecture of the system Johnson & Johnson created to do this work, Yu offered a few high-level observations. Many different data types are incorporated, he said, including the company’s internal data on manufacturing and operations, procurement, and so on. The company gets extensive market data from external vendors to get a macroeconomic view of the risks to the company from supply chain disruption, and it also looks at climate and environmental data regarding risk to the supply chain.
A graph database model is structured to mimic how the information in Johnson & Johnson’s databases flows from raw materials to intermediates to finished products, and simulations are run on that graph database model. The simulations are designed to understand the risks of supply chain disruptions and to test various mitigation strategies. These simulations look for how to minimize the impact of any disruptions on the company’s supply chain, both from suppliers and to customers.
In closing, Yu described Johnson & Johnson’s innovation in moving from the traditional supply chain network analysis, which was done in a more manual and qualitative fashion, to a much more systematic and qualitative approach. This makes it possible, he said, to manage the different risks predictively and to design corresponding plans for how to mitigate those risks in a predictive and proactive manner. “And,” he added, “the whole process is automated as well, with all the automated datalink and data pipelines to flow the data from one end to the other end.”
He finished with a list of next steps for Johnson & Johnson. They are improving its technology in order to, for instance, have more automated scenario modeling and simulation, to develop and scale a resilience investment recommendation model, and to develop new dashboards and user interfaces. The company wants to build expertise in supply chain resilience by, for instance, creating a dedicated communication portal for training, awareness, and internal and external communications. And the company is working to integrate the various elements in the whole supply chain network to bring the individual components together so that the company can respond in a coordinated way to any future supply chain disruptions.
The discussion opened with Obcemea asking about uncertainty quantification in the models the speakers had presented, a question that revealed the complex nature of assessing AI reliability in medical applications. Rosen acknowledged that determining whether his low-power MRI was “good enough” for clinical practice fell outside his expertise, but he explained that his AUTOMAP reconstruction approach offered ways to parameterize accuracy using local Lipschitz uncertainty methods. He stated that his team had confidence in the
reconstructions’ accuracy regarding radiological concerns.
While his team’s quantitative morphological analysis work remained too new for extensive uncertainty quantification studies, Rosen explained that the team could evaluate performance against ground truths. The data consistently aligned with these ground truths across the team’s studies, though he acknowledged potential limitations: “If we have some very large clinical outlier, will it break? Maybe . . . . If someone had a massive surgical resection, will we get the wrong answer? Maybe. We haven’t looked at those cases yet.”
Rosen contextualized his work within broader healthcare challenges, pointing to new Alzheimer’s drugs that showed promising results but required frequent MRI monitoring for side effects. The potential patient population for these drugs far exceeded global MRI capacity for monthly or bi-weekly screening, creating an ideal application for low-field MRI technology.
Jafari explained that uncertainty in his sensor and PINN blood pressure modeling could arise from multiple sources: data noise, missing data, or model inability to handle specific data types. Different approaches exist for addressing these issues, including examining noise in data sources through available models and analyzing latent space distributions in neural networks to identify unfamiliar or overly noisy distributions. His work’s specific goal involved learning physics and physiology laws through hemodynamic monitoring data to understand blood pressure. When models deviated from expected patterns—whether due to model changes or data that no longer fit because of noise—this deviation helped identify uncertainty and establish appropriate limits.
The question of how to report uncertainty led Hanson to describe her work using large language models for near real-time disease classification in patients. Her team trained models not only to classify diseases but also to learn uncertainty, enabling predictions that specified what could be trusted versus what required manual review or questioning. She explained her approach as identifying “spaces where there should be more questions or maybe more experimentation on what’s happening in the data.”
Li raised an audience question about the biggest gaps in current regulatory frameworks related to AI uncertainty, prompting responses that highlighted the rapid pace of AI development and its regulatory challenges. Hanson expressed the difficulty of keeping pace with AI developments, noting that she lacked exact answers for necessary policy changes because “I can barely figure out what to do in my own group. Every day a new paper comes out, and every day I am already behind.” She emphasized that not every new development represents genuine signal, with much being noise, leading to challenges in parsing “what is real, what’s fake, what’s shiny, [and] what’s reasonable to do.” Her practical approach focused on thoroughly vetting models before use, and she acknowledged that this represents a practitioner’s reality not a direct policy answer.
Yu provided an industrial perspective, describing AI-related regulation as a hot topic in healthcare. While healthcare remains heavily regulated overall, AI technology regulation lags far behind actual field development. Johnson & Johnson’s response involves understanding the global regulatory landscape, recognizing that each nation has different evolving approaches. The company heavily engages with government agencies and academia to offer its perspective and help shape policy development affecting its operations.
An audience member’s question about verification and validation practices revealed different approaches across medical imaging, academic research, and industrial applications. Rosen explained that medical imaging is typically built upon currently accepted standards. Newer developments, like 256-channel head coils, demonstrated performance improvements over existing state-of-the-art systems. Reconstruction advances followed similar patterns, with Rosen citing compressed sensing as an example where claims of under-sampling by factors greater than one-half initially caused skepticism despite being theoretically sound. The Food and Drug Administration eventually approved compressed sensing after proper demonstration.
Jafari noted that classic validation involves repeating techniques with new participant groups, though this isn’t always possible, particularly for academic work. He argued that verification and validation approaches should align with end applications and intended uses. For his blood pressure measurement work targeting hemorrhage detection, validation needed to mimic specific standards examining coronary cases, which required
benchmarks that rigorously examined potential failure points. He suggested increased use of digital evaluations, comparing the approach to validating airplane chips through simulation rather than jumping directly to expensive randomized controlled trials. For image segmentation approaches, phantom testing could occur before reaching clinical trials, offering more efficient validation pathways.
Yu described his industrialized project’s two-part validation approach: large-scale simulation for better metric understanding, followed by Monte Carlo simulation when calculating probabilistic information flow and propagation in suppression networks. This helped his group understand how metrics matched across different scenarios the group attempted to mimic. His team implemented real-time or near real-time monitoring by building pipelines that incorporated external risk event data from third parties into their internal systems. This enabled near real-time calculation of events as they occurred, with back-validation to assess whether metrics provided reasonable recommendations for different scenarios.
Iglesias, J. E., R. Schleicher, S. Laguna, B. Billot, P. Schaefer, B. McKaig, J. N. Goldstein, K. N. Sheth, M. S. Rosen, and W. T. Kimberly. 2023. Quantitative brain morphometry of portable low-field-strength MRI using super-resolution machine learning. Radiology 306(3):e220522.
Koonjoo, N., B. Zhu, G. C. Bagnall, D. Bhutto, and M. S. Rosen. 2021. Boosting the signal-to-noise of low-field MRI with deep learning image reconstruction. Science Reports 11(1):8248.
Lu, Z.-L., T. Hua, C.-B. Huang, Y. Zhou, and B. A. Dosher. 2011. Visual perceptual learning. Neurobiology of Learning and Memory 95:145–151.
Sasaki, Y., J. E. Nanez, and T. Watanabe. 2010. Advances in visual perceptual learning and plasticity. Natur Reviews Neuroscience 11(1):53–60.
Sorby-Adams, A. J., J. Guo, P. Laso, J. E. Kirsch, J. Zabinska, A.-L. Garcia Guarniz, P. W. Schaefer, S. Payabvash, A. de Havenon, M. S. Rosen, K. N. Sheth, T. Gomez-Isla, J. E. Iglesias, and W. T. Kimberly. 2024. Portable, low-field magnetic resonance imaging for evaluation of Alzheimer’s disease. Nature Communications 15:10488.
Zhu, B., J. Z. Liu, S. F. Cauley, B. R. Rosen, and M. S. Rosen. 2018. Image reconstruction by domain-transform manifold learning. Nature 555(7697):487–492.