The second session of the workshop series was held on March 15, 2023. Kimberly Lomis, the vice president of undergraduate medical education innovations for the American Medical Association, served as moderator for the session. In contemplating a smoother implementation of artificial intelligence (AI) training across health professions, she said, it is important to explore the social, cultural, policy, legal, and regulatory considerations of AI in health practice and education. This session was designed to explore these topics, with a focus on discussing the combination of computational abilities and human abilities that can best serve patients.
Alison Whelan, the chief academic officer at the Association for American Medical Colleges, was asked to fill in for a speaker who had to cancel suddenly. She opened her remarks by saying that when thinking about AI in health professions education, there are two different aspects to consider. First, students need to be educated to be competent in their practice and be prepared to use AI-based technologies in this role. Second, AI can be used to enhance the educational experience of students while they are in school. Whelan provided examples of ways in which AI can be used to improve education:
While AI holds promise in these areas, she said, it also presents ethical, moral, and legal challenges. Whelan then revealed that the examples she just shared were created by ChatGPT (Chat Generative Pre-Trained Transformer), albeit with minor edits. She asked the audience if she had an obligation to disclose this information and whether her obligation might differ depending on whom she was talking to (e.g., patients, students). The way that students and scholars have gathered information to prepare a speech or a paper has changed dramatically in the past several decades, Whelan said. She spoke about how, years ago, she used card catalogues to find sources and accessed books or journals directly in the library or through interlibrary loans. Soon she had collected file cabinets full of articles and would pull them out as she needed information. Now she has an electronic standing bibliography full of articles that were found via an internet search. Search results and resources such as subscription-based, clinical decision-making resources can make it easy to find evidence, but what, she asked, is the obligation of scholars to check beyond the “black box” to look at the evidence and think critically about what is and is not represented? There are many black boxes in medicine, from magnetic resonance imaging machines to lab tests; clinicians understand the inputs and outputs but may not be aware of the algorithms that are used to transform inputs into outputs, she added.
Whelan closed with a series of questions to the audience. In health care, she said, there has always been an implicit belief that health professionals are subject matter experts and that this expertise gives them the responsibility of making decisions about patient care. With the advent of AI, will health professionals retain this role, or will their role shift? Tools such as smart-watches provide data; will health professionals have the expertise necessary to understand these data and help patients make informed decisions?
Building on Whelan’s questions about the future role for clinicians in the context of AI, speaker Nathaniel Hendrix, a researcher and data scientist at the American Board of Family Medicine, discussed the role
that AI may play in health care moving forward and what this means for clinicians. The cost of health care has increased at a much faster rate over the past 20 years than the costs of other goods and services, he noted. One theory of why, he said, is that other parts of the economy are better able to take advantage of new technologies to improve productivity and that improved productivity can lead to slower growth in prices. In health care, there are many barriers to adopting new technologies, and it can be difficult to increase productivity. Hendrix shared an analogy made by economists William Baumol and William Bowen (1965), who noted that the number of musicians to perform a certain symphony has not decreased over time (i.e., there has been no increase in productivity), and the wages of these musicians have gone up. This increase in cost has been absorbed, in part, by turning some of the services of musicians into goods; that is, musical performances have been put on streaming services and made available for purchase. Health care is similar—it takes a certain amount of time and labor to perform tasks, so reducing costs may require taking some health care services and turning them into goods. This transformation, Hendrix said, may require using AI systems to perform some of the traditional tasks of clinicians.
Hendrix proceeded to discuss different ways that AI can interact with clinicians (Figure 2-1):
Hendrix gave several examples of how these different ways of AI usage would look in the clinical setting. An AI system could be used to make a recommendation or to offer an opinion to the clinician. If the AI system and clinician disagree, the decision could be referred to a third party. AI could also be used to identify weaknesses in clinical decision making by humans and to counteract these weaknesses. AI could be used to make triage decisions for very high-risk or low-risk patients; in times of uncertainty, the system could defer to a human clinician.
One of the major benefits of AI, Hendrix said, is that it makes it possible to have no limit on the number of factors used to estimate risk. Studies have shown that clinicians tend to inaccurately model risk; for example, Mullainathan and Obermeyer (2022) found that physicians gave too much weight to age and certain symptoms when assessing the risk of heart attack. Furthermore, many clinicians show “left-hand bias,” where the first digit of a patient’s age affects their assessment of risk and appropriate care; as
a result, a 39-year-old and 40-year-old would be assessed differently, even though they are almost exactly the same age. To counteract these types of biases, Hendrix said, AI could consider multiple factors and give a numerical prediction, and a threshold would be set at which a certain action would be taken (e.g., screening). The appropriate threshold would vary depending on the characteristics of the condition (Figure 2-2). For a condition in which the benefits of detection outweigh the risks of false positives, the threshold would be low in order to improve the chances of detecting all true positives. For a condition in which the risks of false positives are high, the threshold would be set higher so that fewer people would be tested. It may be appropriate to set thresholds on both sides and have a human clinician make decisions when the risk assessment is in the grey area. Physicians, patients, ethicists, payers, and other stakeholders should play an active role in setting these thresholds and revisiting them as technology advances, Hendrix said.
Decisions about how to use AI should take into account not only the clinical circumstances but also the algorithm’s accuracy. However, knowing an algorithm’s accuracy is generally not enough information to determine whether it is worth using. Hendrix gave two examples to illustrate this. In the first, an algorithm is very accurate; it catches a few more cases than the clinician, but it is largely duplicating the work of the clinician. In the second, the algorithm catches fewer cases than the clinician and misses many of the cases that the clinician identifies, but nearly all the cases that it identifies are ones that would have been missed by the clinician. AI recommendations may be most useful when they contradict a clinician, Hendrix said, which creates a challenge for how to communicate an AI recommendation. As clinicians get more experience getting information from AI—particularly assessments that contradict their own—they may be able to use this feedback to calibrate their own predictions better. It is exciting to think of the ways in which AI can be used to help clinicians continually learn and adapt, Hendrix said.
The effective use of AI by health care professionals requires discernment, and an individual’s ability to distinguish accurate from inaccurate AI-provided advice may vary. Furthermore, clinicians vary in what they want from AI and how they want to interact with it. Hendrix’s research on primary care providers and radiologists found that primary care providers are overwhelmingly concerned with sensitivity and are focused on the evidence base for the algorithm. Around 75 percent are willing to allow AI to make decisions about very low-risk cases without radiologist supervision. Radiologists, on the other hand, have a greater concern for limiting false positives while boosting detection and are more focused on how AI integrates into the workflow. The starkest difference was that none of the radiologists were willing to let AI make decisions without supervision.
Patients also vary in their comfort level with AI. A Pew Research Center survey (Tyson et al., 2023) found that 60 percent of Americans would be uncomfortable with their health care provider relying on AI, although this varied by such characteristics as gender, race, age, and familiarity with AI (Figure 2-3). The same survey found that a majority of Americans (65 percent) want AI to be used in skin cancer screening and that a large majority (79 percent) do not want to use an AI chatbot to support their mental health (Tyson et al., 2023). Hendrix said he believes that to make sure AI can be a force to improve equity, clinicians will need to be skilled in talking to their patients about the benefits and risks of using AI.
From an economic perspective, the hope is not that AI will replace clinicians, Hendrix said, but instead that the technology will help make health care more affordable and more accessible. However, integrating AI into the health care system would require clinicians to adopt a different skillset. The qualities most important for clinicians in the age of AI, he said, would be the following:
Uncertainty is pervasive in medicine, said Alex John London, the director of the Center for Ethics and Policy at Carnegie Mellon University, and how this uncertainty is dealt with is an ethical issue. Advancing the moral mission of medicine entails a duty to learn how to reduce medical uncertainty and to reduce unwarranted variation in practice. There is an ethical duty to make health systems more effective (improving patient outcomes), efficient (good stewardship of scarce resources), and equitable (treating all as moral equals), London added.
Medicine relies on a division of labor and expertise, he continued. Experts produce information on which others—both experts and non-experts—rely. The reliability and accuracy of this information affects the ability of each of these stakeholders to discharge their moral obligations. London said that consumers of information must trust the information producers; such trust is critical to the willingness to rely on information and act on information. While trust builds up the scientific ecosystem, hype and fear can degrade it.
London emphasized that when an ecosystem is inflated by hype or divided by fear, it can become challenging to develop systems capable of performing tasks that will make health systems more effective, efficient, or equitable. Hype and fear can make it difficult to deploy systems or to review and revise systems after deployment. In such a scenario, London said, three potential risks exist: underuse of an effective system, overuse of an ineffective system, and misuse of a system that is only effective for a different task.
London said that the scientific ecosystem, in the context of AI and medicine, is “really unhealthy at the moment.” Efforts to communicate and educate stakeholders about AI often “wind up perpetuating outsized expectations.” For example, a paper by Obermeyer et al. (2019) compared expert systems to an “ideal medical student” and machine learning (ML) to a “doctor progressing through residency.” The paper implied that AI could simply take in information and then apply it in new situations. A better way of thinking about AI, London said, is to think of it as a “family of technical techniques” for learning, in the same way that the toolkit for radiological imaging includes different types of imaging and technologies. There are different ML techniques, each is suited for a different type of task, and each has its own strengths and limitations. It is important to educate and train stakeholders to understand these differences and uses and to get away from the idea that clinicians will be “displaced by a kind of synthetic human in a box.”
An unhealthy scientific ecosystem creates challenges across the life cycle of AI development and deployment, London said, specifying that one of the key challenges with AI in medicine is ensuring that the capability of the system matches a task that is clinically appropriate and that will meet a need of patients or health systems. London offered two examples of AI-based technologies with capabilities that did not end up being useful in a clinical or research setting. First was the IBM Watson system, which was initially designed to answer questions on the game show Jeopardy. There were outsized expectations of Watson, and one of its developers cautioned that Watson was engineered to predict correct answers for trivia; it was not “an all-purpose answer box ready to take on the commercial world.” MD Anderson Cancer Center partnered with IBM Watson to create an advisory tool for oncologists, London said, but the project was scrapped in 2016 after $62 million had been spent. Recently, Watson Health was “sold for parts,” with a private equity firm buying some pieces for more than $1 billion. London emphasized that this was not a “mom and pop” enterprise that failed due to a lack of resources; there were 7,000 employees involved in Watson Health at its peak. It was described by the media as a “total failure that they needed to just cut their losses and move on.” This is an example, he said, of a major corporation trying to make AI work in medicine that failed in part because of a mismatch between the capabilities of
the system and the task it was assigned. London’s second example was more recent—Meta created a system to synthesize scientific literature in 2022. The program, dubbed “Galactica,” lasted around 3 days after its release, partly because it wrote racist language and partly because it fabricated evidence and citations. This is another example of a mismatch between the capabilities of a system and the job it was given.
Even if an AI system’s capabilities are well-matched with its tasks, London said, the data in the system must also be fit-for-purpose. Many health care data are collected for purposes other than research; for example, electronic medical record data reflect billing, administrative, and clinical purposes. Researchers may need data that are more granular, more frequent, or have broader variables. There is a need, he said, for better awareness of the value and limitations of the data relative to the data that are needed. With this awareness in hand, stakeholders can evaluate the prospects for implementing AI in particular spaces.
The development, testing, and deployment of new AI technologies are similar to that of any other innovation, London said. Just as most drugs in development do not end up working, most new AI systems will likely not end up working. However, the system for validating AI systems is relatively immature in comparison to the robust drug approval system. There are few prospective clinical trials of AI interventions, and the performance measures used to validate systems are often not clearly linked to meaningful clinical outcomes. A study of AI-based medical devices approved by the U.S. Food and Drug Administration found that none of the high-risk devices were evaluated by prospective studies and that a majority of approved devices did not include publicly reported multi-site assessment in the evaluation (i.e., the evidence base for the device was derived from a single institution) (Wu et al., 2021).
In closing, London shared four discussion takeaways on the current landscape and potential of AI-based technologies:
London concluded that there is room for improvement in our evaluation of AI-based technologies and that this improvement would require a broader education among stakeholders about the strengths and limitations of AI systems.
Following the presentations, Lomis moderated a question-and-answer session among the panelists.
As AI moves into the clinical space, Lomis said, the role of the clinician will shift from a “steward of knowledge” toward a curator or translator of knowledge. This shift has already occurred to some extent with the explosion of information available to patients, she added. Whelan agreed that the integration of AI is not an entirely new phenomenon but instead a step on the continuum of how knowledge is accessed and used. AI can be a tool for managing knowledge and helping to improve the care of all patients. The issue, she cautioned, is whether the data that feed into AI systems are inclusive of the entire patient population. Lomis responded that this was unlikely and that there is bias already underlying existing datasets. Lomis cautioned that AI may run the risk of amplifying that bias. Whelan added that since algorithms are created by people, they often reflect the biases of their creators; having diverse teams generate algorithms may be one way to mitigate such risks, assuming that different groups are represented on the team. London underscored how the incompleteness of representation in datasets reflects the unequal access to health care, which is a difficult problem to overcome. At the same time, there is a push to use AI to achieve a more equitable health system, but AI systems are built on inequitable data. It is a “chicken and egg problem,” he said. In addition to inequitable access, London said, even some of the sensors used to collect data are systematically biased. For example, peripheral oxygen detectors work better on lighter skin (Mantri and Jokerst, 2022). Bias “goes very, very deep in medicine,” he said. London asked participants, “If clinicians do not even have the technologies to accurately collect data on different patient populations, can AI systems ever be unbiased?”
The most successful clinicians are those who have been taught how to deal with variability and tolerate uncertainty, Carole Tucker, associate dean of research at the University of Texas Medical Branch, said. AI may add to the uncertainty that clinicians must deal with, for example, when an AI recommendation contradicts the assessment of the clinician. Hendrix said there is a deep-seated desire to be certain, particularly when making decisions that affect people’s health, and that part of the role of clinicians is to manage expectations and communicate clearly with patients about their level of certainty. He added that rather than acting as authoritative experts,
clinicians need to share the process of decision making and do the best they can with the information they have. Hendrix further commented that one role for clinicians will be detecting when AI systems are working well and when they are not; this requires good clinical judgment. However, if AI systems are taking on some of the clinical decision making, this reduces the opportunities for clinicians to develop their clinical judgment. One solution, he said, might be to take AI “off autopilot” in order to create intentional opportunities for developing and retaining clinical judgment. Sanjay Desai, the chief academic officer at the American Medical Association, made an analogy to driving, saying that he would not trust his teenage children in a self-driving car because they have not yet developed the judgment for when they should take over the wheel. However, once the technology improves, the self-driving car would likely be far safer than a car piloted by a human driver. Likewise, the technologies in health care are still immature, Desai said, encouraging clinicians and AI to work together rather than letting one take the lead.
Lomis noted that the suggested competencies for future health care professionals who will be working with AI are different from the more historical competencies considered important for health care practice; for example, speakers identified numeracy skills, comfort with data, and ability to accept input from other sources. At the same time, speakers talked about how AI may free up clinicians so they can focus on the interpersonal side of health care. Lomis commented that good data skills and good interpersonal skills aren’t typically seen in the same person, and she wondered whether this has an implication for recruitment of learners. Whelan responded that the most important attributes for learners are a growth mindset and the ability to work together in a team of people who have different competencies, interests, and skills. The team of health care professionals will expand, she said, and include people who have not traditionally been part of the team (e.g., engineers). Whelan added that health care professionals will need to learn what other professionals bring to the table, be comfortable working with others, and be able to speak the same language. This will require giving students and practitioners many opportunities for interprofessional education and for interfacing with other team members and technologies.
Can AI be plugged into our existing system, Lomis asked, or is there a need to rethink structures and processes? Hendrix answered with a
metaphor: “You can’t take a Rolls Royce engine and put it in a Toyota and expect the Toyota to drive better.” In other words, Hendrix explained, simply providing better information to clinicians without making it interpretable or actionable is likely to be a burden rather than a benefit. Clear workflows need to be developed for the integration of AI; when new technologies are introduced, it will be important to consider how they will affect decision making and taking action and who specifically will be affected.
This page intentionally left blank.