What is AI inference?

Published January 7, 2025•7-minute read

AI inference is when an AI model provides an answer based on data. What some generally call “AI” is really the success of AI inference: the final step—the “aha” moment—in a long and complex process of machine learning technology.

Training artificial intelligence (AI) models with sufficient data can help improve AI inference accuracy and speed.

Explore Red Hat AI

For example, when an AI model is trained on data about animals—from their differences and similarities to typical health and behavior—it needs a large data set to make connections and identify patterns.

After successful training, the model can make inferences such as identifying a breed of dog, recognizing a cat’s meow, or even delivering a warning around a spooked horse. Even though it has never seen these animals outside of an abstract data set before, the extensive data it was trained on allows it to make inferences in a new environment in real time.

Our own human brain makes connections like this too. We can read about different animals from books, movies, and online resources. We can see pictures, watch videos, and listen to what these animals sound like. When we go to the zoo, we are able to make an inference (“That’s a buffalo!”). Even if we have never been to the zoo, we can identify the animal because of the research we have done. The same goes for AI models during AI inference.

Find out what's new and what's next for Red Hat AI at our next live event. Catch the next live session.

AI inference is the operational phase of AI, where the model is able to apply what it’s learned from training to real-world situations. AI’s ability to identify patterns and reach conclusions sets it apart from other technologies. Its ability to infer can help with practical day-to-day tasks or extremely complicated computer programming.

Predictive AI vs. generative AI

Today, businesses can use AI inference in a variety of everyday use cases. These are a few examples:

Healthcare: AI inference can help healthcare professionals compare patient history to current data and trace patterns and anomalies faster than humans. This could be an outlier on a brain scan or an extra “thump” in a heart beat. This can help catch signs of threats to patient health much earlier, and much faster.

Finance: After being trained on large data sets of banking and credit information, AI inference can identify errors or unusual data in real-time to catch fraud early and quickly. This can optimize customer service resources, protect customer privacy, and improve brand reputation.

Automotive: As AI enters the world of cars, autonomous vehicles are changing the way we drive. AI inference can help vehicles navigate the most efficient route from point A to point B or brake when they approach a stop sign, all to improve the ease and the safety of those in the car.

Many other industries are applying AI inference in creative ways, too. It can be applied to a fast food drive-through, a veterinary clinic, or a hotel concierge. Businesses are finding ways to make this technology work to their advantage to improve their accuracy, save time and money, and maintain their edge with competitors.

More AI/ML use cases

AI training is the process of using data to teach the model how to make connections and identify patterns. Training is the process of teaching a model, whereas inference is the AI model in action.

What are foundation models?

Most AI training occurs in the beginning stages of model building. Once trained, the model can make connections with data it has never encountered before. Training an AI model with a larger data set means it can learn more connections and make more accurate inferences. If the model is struggling to make accurate inferences after training, fine-tuning can add knowledge and improve accuracy.

Training and AI inference are how AI is able to mimic human capabilities such as drawing conclusions based on evidence and reasoning.

Factors like model size can change the amount of resources you need to manipulate your model.

Learn how smaller models can make GPU inference easier.

Different kinds of AI inference can support different use cases.

Batch inference: Batch inference gets its name from how it receives and processes data: in large groups. Instead of processing inference in real time, this method processes data in waves, sometimes hourly or even daily, depending on the amount of data and the efficiency of the AI model. These inferences can also be called “offline inferences” or “static inferences.”
Online inference: Online inference or “dynamic” inference can deliver a response in real time. These inferences require hardware and software that can reduce latency barriers and support high-speed predictions. Online inference is helpful at the edge, meaning AI is doing its work where the data is located. This could be on a phone, in a car, or at a remote office with limited connectivity.
OpenAI’s ChatGPT is a good example of an online inference—it requires a lot of upfront operational support in order to deliver a quick and accurate response.
Streaming inference: Streaming inference describes an AI system that is not necessarily used to communicate with humans. Instead of prompts and requests, the model receives a constant flow of data in order to make predictions and update its internal database. Streaming inference can monitor changes, maintain regularity, or predict an issue before it arises.

Learn how distributed inference with vLLM can alleviate bottlenecks

An AI inference server is the software that helps an AI model make the jump from training to operating. It uses machine learning to help the model apply what it’s learned and put it into practice to generate inferences.

For efficient results, your AI inference server and AI model need to be compatible. Here are a few examples of inference servers and the models they work with best:

Multimodal inference server: This type of inference server is able to support several models at once. This means it can receive data in the form of code, images, or text and process all of these different inferences on a single server. A multimodal inference server uses GPU and CPU memory more efficiently to support more than one model. This helps streamline hardware, makes it easier to scale, and optimizes costs.
Single-model inference server: This inference server only supports one model, rather than several. The AI inference process is specialized to communicate with a model trained on a specific use case. It may only be able to process data in the form of text or only in the form of code. Its specialized nature allows it to be incredibly efficient, which can help with real-time decision making or resource constraints.

The biggest challenges when running AI inference are scaling, resources, and cost.

Complexity: It is easier to teach a model to execute simple tasks like generating a picture or informing a customer of a return policy. As we lean on models to learn more complex data—like how to catch financial fraud or identify medical anomalies—they require more data during training and more resources to support that data.
Resources: More complex models will require specialized hardware and software to support the vast amount of data processing which takes place when a model is generating inferences. A key component of these resources is central processing unit (CPU) memory. A CPU is often referred to as the hub or control center of a computer. When a model is preparing to use what it knows (training data) to generate an answer, it must refer back to the data which is held in CPU memory space.
Cost: All of these puzzle pieces that make AI inference possible are not cheap. Whether your goal is to scale or to transition to the latest AI-supported hardware, the resources it takes to get the full picture can be extensive. As model complexity increases and hardware continues to evolve, costs can increase sharply and make it tough for organizations to keep up with AI innovation.

AI inference only gets more complicated when scaling at the enterprise. And when users can’t easily inference at scale, time to market increases and gen AI use cases are more difficult to apply across the organization.

llm-d is an open source AI framwork that speeds up distributed inference at scale. This means llm-d can support the complex and nonuniformneeds of LLM inference. Using tools like llm-d or LLM Compressor can help you inference faster, taking a big burden off of your team and resources.

What is vLLM?

AI inference is being used in high-volume, high-variable use cases. But deploying LLMs consistently at scale requires a lot of computing power, resources, and specialized operational skills. vLLM can solve these challenges by making more efficient use of the hardware needed to support AI inference in the enterprise. This is why vLLM is especially attractive to industries that need flexibility and control in addition to speed.

As an open source solution, vLLM allows companies to:

Own and manage their GPUs.
Control their data.
Experiment with state-of-the-art models as soon as they are released.

vLLM can be deployed across a variety of hardware including NVIDIA and AMD GPUs, Google TPUs, Intel Gaudi, and AWS Neuron. vLLM is also not restricted to specific hardware, meaning it works across the cloud, in the data center, or at the edge.

Learn how well-known organizations are using vLLM to scale effectively in these 3 real-world use cases.

Explore 3 real-world vLLM use cases

Red Hat AI is a platform of products and services that can help your enterprise at any stage of the AI journey - whether you’re at the very beginning or ready to scale. It can support both generative and predictive AI efforts for your unique enterprise use cases.

With Red Hat AI, you have access to Red Hat® AI Inference Server to optimize model inference across the hybrid cloud for faster, cost-effective deployments. Powered by vLLM, the inference server maximizes GPU utilization and enables faster response times.

Learn more about Red Hat AI Inference Server

Red Hat AI Inference Server includes the Red Hat AI repository, a collection of third-party validated and optimized models that allows model flexibility and encourages cross-team consistency. With access to the third-party model repository, enterprises can accelerate time to market and decrease financial barriers to AI success.

Explore the repository on Hugging Face

Learn more about validated models by Red Hat AI

Red Hat AI is powered by open source technologies and a partner ecosystem that focuses on performance, stability, and GPU support across various infrastructures.

Explore our partner ecosystem

Keep reading

What is explainable AI?

Explainable AI (XAI) techniques, applied during the machine learning (ML) lifecycle, make AI outputs more understandable and transparent to humans.

Agentic AI vs. generative AI

Agentic AI and generative AI explained: Learn how each works, their unique strengths, and how they can collaborate for smarter solutions.

How vLLM accelerates AI inference: 3 enterprise use cases

This article highlights 3 real-world examples of how well-known companies are successfully using vLLM.

What is AI inference?

Red Hat AI

Red Hat Announces Definitive Agreement to Acquire Neural Magic

The adaptable enterprise: Why AI readiness is disruption readiness

Keep reading

What is explainable AI?

Agentic AI vs. generative AI

How vLLM accelerates AI inference: 3 enterprise use cases

Artificial intelligence resources

Platforms

Tools

Try, buy, & sell

Communicate

About Red Hat

Change page language

Red Hat legal and privacy links

Red Hat legal and privacy links