The ways in which artificial intelligence (AI) technologies are used will influence the future impact of AI data centers on electricity use and computing resource demands. Benjamin Lee, University of Pennsylvania, moderated a session examining the trajectory of AI, taking a holistic view of considerations related to data processing, training, and inference.
The panel’s two speakers, Eric Xing of Carnegie Mellon University and Prakhar Mehrotra of Blackstone, briefly described their respective backgrounds before launching into a discussion moderated by Lee. Mehrotra’s current work focuses on bringing AI capabilities to the more than 200 companies in Blackstone’s portfolio; previously, he worked in applied AI at Walmart, Uber, and Twitter. Xing’s work spans fundamental AI research and AI applications. He spoke about his work to improve embodied and counterfactual reasoning of multi-agent models toward a goal of creating AI systems that are capable of complex inferences and reasoning beyond language, which will be needed in order to advance the application of AI to fields such as biology, physics, and social behaviors, where features such as biological structures, the physics of liquids, or the nuances of gaming behavior are not captured in words. He also highlighted the importance of working to reduce the energy costs of running AI data centers but noted that energy use is often secondary to the drive to rapidly produce and improve AI models.
The panelists discussed the capabilities, opportunities, and costs of large language models (LLMs);1 factors that could affect future AI use cases such as latency, hallucinations, and energy needs; and considerations around open-source models and data sharing. As context for the discussion, Mehrotra noted that LLMs have evolved several times, from feature engineering to neural networks to transformer architecture, to arrive at their current status as a fundamental building block for many AI tools. Because today’s LLMs are pre-trained and broadly accessible, they create a level playing field for using AI without requiring the technical expertise needed to understand how they work or build a model from scratch. Xing added that the architecture and design of LLMs have been largely stable for the past several years. Advancements have made the models much larger and more powerful, but there are still limitations around tokenization, embedding, attention, training loss, and algorithm design. In addition, because the inferencing capabilities are dependent on memorization and autoregression, he noted that the cost-to-return ratio will diminish as sources of textual data are exhausted.
Looking forward, Mehrotra suggested that the next generation of AI technologies is likely to bring LLMs that are smaller, more domain-specific, and more accurate, potentially fueling paradigm shifts across many fields. Lee wondered if specialized, domain-specific LLMs might be more energy efficient than large, general-purpose models, and Mehrotra replied that the answer is unknown. Businesses will need domain-specific LLMs, but there is not yet a cost-effective pathway to create and continually train them. General-purpose LLMs are still in their early stages and will need at least a few more years to provide adequate accuracy, and while a consumer using an LLM for personal or entertainment purposes can accept wrong answers, a business cannot. “The only metric that matters in the enterprise setting is how good your results are,” Mehrotra stated.
Lee asked what additional costs might be associated with domain-specific training for LLMs. Xing replied that domain-specific models have very different economics than large-scale LLMs with their trillions of parameters, which can be challenging to fine-tune. Smaller models, with only billions of parameters, are easier to fine-tune and may not need to rely on existing LLMs. For example, specialized models can regain the performance of LLMs with an architecture that is continuously incrementable
___________________
1 LLMs are AI programs that can understand and generate text. ChatGPT is one example.
and growable, enabling additional components to be harmonized for more capabilities.
Ayse Coskun, Boston University, prompted panelists to delve deeper into the pros and cons of large, general models versus customizing domain-specific, smaller models that can be installed locally. Mehrotra said that both types of models have value and speculated that eventually a hybrid system will develop in which consumers use general LLMs to handle requests and work in multiple modalities, while business, research, and other domain-specific users will want smaller, custom models that can be fine-tuned and learn continually. In fact, edge computing could operate on much smaller models that fit on a smartphone, creating a latency advantage. In this scenario, latency and privacy could become the deciding factors in determining which model sizes are optimal.
Xing agreed that there is not one “best” size for an AI model and added that sometimes a system’s optimal size is defined by the hardware environment. Seventy billion parameters is often used because it fits on one graphics processing unit (GPU), but having multiple GPUs solves that problem. He also highlighted a variety of other factors that currently pose a barrier to progress in developing custom or domain-specific models, including a limited workforce with the skills needed to fine-tune models; a lack of industrial standards for data format, interfacing, or transfer; and a lack of coordination across silos, which makes it challenging to create standards, general practices, and training curricula. In response to a question from Ataliba Miguel, TotalEnergies, Lee noted that fine-tuning today is too idiosyncratic to lend itself to a universal approach, and more workforce training is needed. Mehrotra agreed and also pushed for fine-tuning standards.
Jonathan Herz, Environmental and Energy Study Institute, asked if the situation faced with AI systems today could be similar to the way computer operating systems have historically been built on layers of energy-hungry building blocks, resulting in decreasing efficiency over time. Mehrotra replied that LLMs are building blocks of multi-agent systems that can accomplish multiple tasks that today are completed by software. Agent-driven decisions will increase energy costs because of the many steps involved and the potentially large number of users, but the tasks themselves will become easier and more efficient. Xing added that today’s AI models are very different from classic software engineering, which has clear definitions and cost-effectiveness measures that eventually reach a point of diminishing returns. Such limitations are not yet apparent for AI, which he speculated will not see a big emphasis on cost efficiency until it is commercially viable and scientifically valuable. He expressed his belief that experimentation with AI will lead to surprising functions and discoveries beyond what we can envision today. Lee
agreed that one reason AI systems are not experiencing “software bloat”2 is because their boundaries and capabilities are still expanding. However, he reiterated that there are many uncertainties around the costs of training LLMs in order to discover their capabilities and create smaller, specialized models for effective domain-specific inferencing.
Xing added that another opportunity is to use LLMs as foundations for image, audio, and other modalities by optimizing interfacing tokenization processes, or the ability to encode text into smaller units. By focusing on interface optimization as opposed to building new models from scratch for each modality, multimodal LLMs can be more cost effective. Lee asked how this approach would impact the training costs, computational load, and electricity use for AI systems. Xing replied that computational load inefficiencies could be improved by learning from cognitive science findings and applying attention and tokenization differently depending on what information is being received. The transformers and attention arrays of current LLMs cannot do this but instead homogeneously apply attention to all signals. Heterogeneous attention mechanisms, along with data quantization and compression schemes, could be exploited to more efficiently understand the signal, Xing suggested. He also added that training loss is another area that can be improved by looking at human learning patterns. In the next few years, he speculated that new LLM architectures and learning mechanisms will reduce costs without compromising performance, enabling the development of non-language models and multimodal LLMs. Human brain–inspired architectures are still nascent but have enormous potential to be more energy-efficient and adaptive—current LLMs must activate every parameter for every task, whereas brain-inspired models could use only necessary parameters, making storage and retrieval of knowledge more efficient.
Mehrotra cautioned that adding video as a modality is incredibly complicated because it is three-dimensional, has multiple frames, and requires temporal attention. There might not be an existing transformer architecture to support those capabilities. However, a major strength of today’s LLMs is their high degree of flexibility, with low generalization error rates for low-complexity tasks. Human cognition is far more complicated, but machines have internal patterns that can be automated to
___________________
2 Software bloat is a process where new versions of a computer program become successively slower, use more memory or processing power, or have more hardware requirements than previous versions as new features are added.
mitigate the new demands on the grid as every sector looks to train and fine-tune LLMs. To bend or flatten this demand, he suggested that businesses will either revert to older machine learning paradigms that have improved accuracy, which requires time and resources, or seek architecture and chip advances that create efficiencies.
Valerie Taylor, Argonne National Laboratory, asked about the potential to use LLMs to train AI-powered scientific assistants in different disciplines. Xing replied that foundational LLMs are inadequate for science because they only process language—and biological data, such as gene expression or protein structures, are not adequately captured in human language. New models with dedicated architectures, bandwidths, and inferences are needed to understand scientific data. Nevertheless, Xing suggested that general LLMs can be used for some areas of scientific inquiry as researchers learn how to best phrase a query, and those LLMs could then engage with or evolve into more useful multimodal models. In the meantime, many AI tools for scientific research have been created, and although they are still experimental, they have been useful and are attracting more attention. “I see that as a huge inspiration and acknowledgement and confirmation of the value of AI in science, but it’s just a starting point,” Xing stated. “The next few years will see a big uptick.”
Mehrotra added that AI is already useful in the literature review phase of research, boosting productivity and suggesting new ideas as scientists formulate research questions and design experiments. For hypothesis-building, AI tools can also help narrow the search space and validate hypotheses. In the experimental and validation phases, however, fine-tuning and customized models will be needed, and Mehrotra said that the technology in this space is not yet mature.
Panelists explored how demands related to AI training differ from those stemming from inference, along with factors that may affect inference demand, scale, and efficiency in the future. Lee stated that LLMs will continue to be trained, specialized, and fine-tuned, but it is unclear how AI data centers’ costs and energy usage will change once they move from the high fixed costs incurred during training to the variable costs and energy demands of inference as LLMs become more widely used. Mehrotra replied that inference is in its earliest phases, and there are currently only a few examples of inference workloads, such as ChatGPT, Llama, and Gemini. Their key energy metric is not training but concurrent
users. If that number grows to the millions or tens of millions, much more equipment and replication will be needed in order to keep latency3 low. Latency is also impacted by how many steps an AI model needs to answer a question. Mehrotra noted that latencies are often in the milliseconds, and those longer than 1–2 seconds were once considered unacceptable, but experience has shown that in the case of AI, users are more tolerant of latency—being willing to wait 25–30 seconds for a result—and allowing for more latency could enable alternative inferencing designs. In inferencing, there is not one single metric to measure performance, and so multiple factors, such as chip type or GPU supply, could be altered to impact latency. However, Mehrotra said that uncertainties in terms of the number of concurrent users, the steps needed to answer a question, and the acceptable latency range complicates the task of creating inference optimizations and efficiencies.
Prompted by a question from Trisha Ray, Atlantic Council, regarding users’ willingness to accept latency, Xing said that it is too early to know what latency levels will be acceptable to users and this is likely to change as AI—and the public’s expectations—rapidly evolve. Mehrotra added that when AI is used to answer questions or perform calculations that would be infeasible for humans to solve without AI tools, people are likely to be willing to accept higher latency. Xing speculated that the acceptable latency for chatbot applications may be different from that of physical-world applications of AI, such as home robots or autonomous cars. He said that such applications will be federated but distributed, with different latency and self-learning demands that blur the boundaries between training and inference.
Lee agreed that both latency and throughput4 connect to fundamental issues of computing performance and capabilities and that these are currently in flux, making it challenging to predict future AI use and impacts. In addition, it is unclear what total bandwidth AI systems might require, since this will be heavily influenced by how models are used. If humans are inputting prompts, the total bandwidth will be constrained by the speed at which humans can type, for example, whereas the bandwidth would be on a different order of magnitude if AI agents are inputting prompts.
Caitlin Grady, George Washington University, asked how the phenomenon of AI hallucinations might impact the acceptance and uptake of AI and the associated energy demands. Mehrotra replied that hallucinations increase energy costs because more prompts are needed to find the right answer. These extra steps do not seem to put people off, however, because
___________________
3 Latency is the amount of time it takes for a system to provide a response to a user.
4 Throughput is the number of queries that a system can respond to per second.
AI has still proven itself useful and hallucinations are relatively easy to correct. He added that it would be useful to conduct research into how humans perceive AI, when it is correct and when it is hallucinating. Xing added that AI hallucinations are often obvious to humans, and they are an important part of the AI learning process but also indicate that AI output should not be taken as truth. He said that an alternative technique, uncertainty quantification, could improve researchers’ understanding of errors and help navigate outcome uncertainties.
Xing noted that inference costs have gone down dramatically, much more so than training costs, because inferencing is more flexible and amorphous. Lee added that it is as yet unclear how the costs of developing and running AI systems might be passed on to consumers. Xing said that currently, inference is essentially used mostly for entertainment because its market is not fully established; however, he speculated that it will ultimately become more valuable as companies use it to augment and optimize their business. This type of use will lead to savings that companies would be willing to pay for and could offset the large costs of creating and running AI models. If AI is used to produce videos, for example, that saves studios money, similar to how virtual meetings require bandwidth but reduce air travel.
Julie Bolthouse, Piedmont Environmental Council, asked if it was possible to resolve the disconnect between the available power supply and AI data centers’ demand. Xing expressed his belief that this issue will resolve itself as companies move away from the free, experimental phase and start monetizing AI products, changing their dynamic with energy providers. AI is excellent at complex math problems but still fails common sense tests, proving that LLMs need more training and perhaps more modalities, but Xing expressed confidence that they will soon achieve the capabilities and level of accuracy that will spur broader adoption. Mehrotra agreed that AI is in its early stages and much of the current energy demand is from training models, not inferencing, and significantly more training will be necessary to overcome challenges such as scaling and multimodal inputs and outputs. He opined that it remains to be seen whether multimodal models may one day achieve artificial general intelligence (AGI), a type of AI that matches or surpasses human cognition, but claims about the plausibility of AGI remain disputed and controversial.
Lee stated that this path is common for technological innovations, where the technology arrives before the “killer apps” are identified. The result is that AI and its energy costs are in a period of massive investment amid much uncertainty. In 5–10 years, Mehrotra speculated that
AI will be fully integrated into everyone’s personal and professional life as a productivity instrument, but still training to achieve AGI. Xing agreed, adding that AI tools can drive paradigm shifts in business, science, research, and daily life, increasing productivity while freeing up more time for leisure activities.
Jason Hick, Los Alamos National Laboratory, asked panelists whether open-source AI models and more data sharing could encourage collaborations that include more stakeholders. Xing agreed, emphasizing that researchers need access to open source LLMs and their training mechanisms to test and replicate them. However, today’s LLMs are owned by businesses, giving the academic community very little access to them except as pre-trained models. The result of this is that many academics lack a basic understanding of LLMs, which impedes their ability to improve their energy efficiency, architecture, or training paradigms. “I’m not sure whether the industry really has a full incentive to prioritize that [openness], because there is a pressure to make revenues and to advance new versions, but the academics who have the incentives of doing that, unfortunately, are not given resources and opportunity and accessibility,” said Xing.
As the field moves toward fine-tuning high-quality data for user-specific applications, Xing suggested that creating open-source LLMs would provide researchers with the resources and opportunities to make improvements. AI is expensive and fast-moving, and he posited that the academic community has a responsibility to advocate against closed LLMs and promote access to fully open models to create visibility and cohesion on shared goals. To that end, his team launched a fully open-source LLM for research purposes,5 a step he characterized as “just a start” of what he believes to be a necessary investment of effort outside of industry and for-profit LLMs to systematically share LLMs and foundation model training. Mehrotra added that AI began as an open-source product. While most models are currently closed, he speculated that at some point that will shift and the models will become open source again.
___________________
5 See LLM360, n.d., “LLM360 Enables Community-Owned AI Through Open-Source Large Model Research and Development,” https://www.llm360.ai, accessed December 15, 2024.