SLMs vs LLMs: What are small language models?

Copy URL

A small language model (SLM) is a smaller version of a large language model (LLM) that has more specialized knowledge, is faster to customize, and more efficient to run.

SLMs are trained to have domain-specific knowledge, unlike LLMs which have broad general knowledge. Due to their smaller size, SLMs require fewer computational resources for training and deployment, reducing infrastructure costs and enabling faster fine-tuning. The lightweight nature of SLMs makes them ideal for edge devices and mobile applications.

SLMs vs LLMs

SLMs and LLMs are both types of artificial intelligence (AI) systems that are trained to interpret human language, including programming languages. The key differences between LLMs and SLMs are usually the size of the data sets they’re trained on, the different processes used to train them on those data sets, and the cost/benefit of getting started for different use cases.

As their names suggest, both LLMs and SLMs are trained on data sets consisting of language, which distinguishes them from models trained on images (e.g., DALL·E) or videos (e.g., Sora). A few examples of language-based data sets include webpage text, developer code, emails, and manuals.

One of the most well-known applications of both SLMs and LLMs is generative AI (gen AI), which can generate—hence the name—unscripted content responses to many different, unpredictable queries. LLMs in particular have become well known among the general public thanks to the GPT-4 foundation model and ChatGPT, a conversational chatbot trained on massive data sets using trillions of parameters to respond to a wide range of human queries. Though gen AI is popular, there are also non-generative applications of LLMs and SLMs, like predictive AI.

Top considerations for building a production-ready AI/ML environment

The scope of GPT-4/ChatGPT is an excellent example that demonstrates one common difference between LLMs and SLMs: the data sets they’re trained on.

LLMs are usually intended to emulate human intelligence at a very broad level, and thus are trained on a wide range of large data sets. In the case of GPT-4/ChatGPT, that includes the entire public internet(!) up to a certain date. This is how ChatGPT has gained notoriety for interpreting and responding to such a wide range of queries from general users. However, this is also why it has sometimes gained attention for potentially incorrect responses, colloquially referred to as “hallucinations”—it lacks the fine-tuning and domain-specific training to accurately respond to every industry-specific or niche query.

SLMs on the other hand are typically trained on smaller data sets tailored to specific industry domains (i.e. areas of expertise). For example, a healthcare provider could use an SLM-powered chatbot trained on medical data sets to inject domain-specific knowledge into a user’s non-expert query about their health, enriching the quality of the question and response. In this case, the SLM-powered chatbot doesn’t need to be trained on the entire internet—every blog post or fictional novel or poem ever written—because it’s irrelevant to the healthcare use case.

In short, SLMs typically excel in specific domains, but struggle compared to LLMs when it comes to general knowledge and overall contextual understanding.

LoRA v QLoRA explained 

Red Hat resources

Training any model for a business use case, whether LLM or SLM, is a resource-intensive process. However, training LLMs is especially resource intensive. In the case of GPT-4, a total of 25,000 NVIDIA A100 GPUs ran simultaneously and continuously for 90-100 days. Again, GPT-4 represents the largest end of the LLM spectrum. Other LLMs like Granite didn’t require as many resources. Training an SLM still likely requires significant compute resources, but far fewer than an LLM requires.

 

Resource requirements for training vs inference

It’s also important to note the difference between model training and model inference. As discussed above, training is the first step in developing an AI model. Inference is the process a trained AI model follows to make predictions on new data. For example, when a user asks ChatGPT a question, that invokes ChatGPT to return a prediction to the user—that process of generating a prediction is an inference.

Some pretrained LLMs, like the Granite family of models, can make inferences using the resources of a single high-power workstation (e.g., Granite models can fit on one V100-32GB GPU2), although many require multiple parallel processing units to generate data. Furthermore, the greater the number of concurrent users accessing an LLM, the slower the model runs inferences. SLMs on the other hand are usually designed to make inferences with the resources of a smartphone or other mobile device.

There are many different factors that can impact the success of inference at scale. Mainly, it depends on how efficiently and effectively your moving pieces are working together. 

Specifically, inference servers that can support larger AI models (like LLMs) and their more complex inference capabilities are essential to scaling AI workloads for the enterprise.

These AI tools use resources more efficiently to inference at scale faster: 

  • llm-d: LLM prompts can be complex and nonuniform. They typically require extensive computational resources and storage to process large amounts of data. llm-d, an open source AI framework, uses well-lit paths to help developers use techniques like distributed inference to support the increasing demands of sophisticated and larger resoning models like LLMs.
  • Distributed inference: Distributed inference lets AI models process workloads more efficiently by dividing the labor of inference across a group of interconnected devices. Think of it as the software equivalent of the saying, “many hands make light work.”
  • vLLM: vLLM, which stands for virtual large language model, is a library of open source code maintained by the vLLM community. It helps large language models (LLMs) perform calculations more efficiently and at scale. It's helping organizations like LinkedIn, Roblox, and Amazon speed up their inference capabilities. 

Why you should care about inference  
 

There’s no answer to the question “which model is better?” Instead, it depends on your organization’s plans, resources, expertise, timetable, and other factors. It’s also important to decide whether your use case necessitates training a model from scratch or fine-tuning a pretrained model. Common considerations between LLMs and SLMs include:

Cost

In general, LLMs require far more resources to train, fine-tune, and run inferences. Importantly, training is a less frequent investment. Computing resources are only needed while a model is being trained, which is an intermittent and not continuous task. However, running inferences represents an ongoing cost, and the need can increase as the use of the model is scaled to more and more users. In most cases, this requires cloud computing resources at scale, a significant on-premise resource investment, or both.

SLMs are frequently evaluated for low-latency use cases, like edge computing. That’s because they can often run with just the resources available on a single mobile device without needing a constant, strong connection to more significant resources.

From the Red Hat blog: Tips for making LLMs less expensive 

Expertise

Many popular pre-trained LLMs―like Granite, Llama, and GPT-4― offer a more “plug-and-play” option for getting started with AI. These are often preferable for organizations looking to begin experimenting with AI since they don’t need to be designed and trained from scratch by data scientists. SLMs, on the other hand, typically require specialized expertise in both data science and industry knowledge domains to accurately fine-tune on niche data sets.

Security

One potential risk of LLMs is the exposure of sensitive data through application programming interfaces (APIs). Specifically, fine-tuning an LLM on your organization’s data requires careful attention to compliance and company policy. SLMs may present a lower risk of data leakage because they offer a greater degree of control.

As businesses integrate SLMs into their workflows, it’s important to be aware of the limitations they present.

Bias

SLMs are trained on smaller data sets, meaning you can more easily mitigate the biases that will inevitably occur (compared to LLMs). However, as with language models of any size, training data can still introduce biases, such as the underrepresentation or misrepresentation of certain groups and ideas, or factual inaccuracies. Language models can also inherit biases related to dialect, geographical location, and grammar.

Teams should pay extra attention to quality of training data in order to limit biased outputs. 

Narrow scope of knowledge

SLMs have a smaller pool of information to pull from as they generate responses. This makes them excellent for specific tasks, but less suitable for tasks that require a wide scope of general knowledge. 

Teams might consider creating a collection of purpose-built SLMs to use alongside an LLM (or LLMs). This solution becomes especially interesting if teams are able to pair models with existing applications, creating an interconnected workflow of multiple language models working in tandem.

The adaptability of SLMs makes them beneficial for a variety of use cases. 

Chatbots 

Use an SLM to train a chatbot on specialized materials. For example, a customer service chatbot might be trained with company-specific knowledge so it can answer questions and direct users to information. 

Agentic AI 

Integrate SLMs into an agentic AI workflow so they can complete tasks on behalf of a user. 

Generative AI 

SLMs can perform tasks such as generating new text, translating existing text, and summarizing copy. 

Explore gen AI use cases

Red Hat AI is a platform of products and services that can help your enterprise at any stage of the AI journey - whether you’re at the very beginning or ready to scale. It can support both generative and predictive AI efforts for your unique enterprise use cases.

With Red Hat AI, you have access to Red Hat® AI Inference Server to optimize model inference across the hybrid cloud for faster, cost-effective deployments. Powered by vLLM, the inference server maximizes GPU utilization and enables faster response times.

Learn more about Red Hat AI Inference Server

Red Hat AI Inference Server includes the Red Hat AI repository, a collection of third-party validated and optimized models that allows model flexibility and encourages cross-team consistency. With access to the third-party model repository, enterprises can accelerate time to market and decrease financial barriers to AI success. 

Learn more about validated models by Red Hat AI

The official Red Hat blog

Get the latest information about our ecosystem of customers, partners, and communities.

All Red Hat product trials

Our no-cost product trials help you gain hands-on experience, prepare for a certification, or assess if a product is right for your organization.

Keep reading

What is LLMops

Large Language Model Operations (LLMOps) Large Language Model Operations (LLMOps) are operational methods used to manage large language models.

What are intelligent applications?

Intelligent applications use artificial intelligence (AI) to augment a human workflow.

Understanding AI in telecommunications with Red Hat

Learn how the right IT solutions can help your telco use AI efficiently and cost-effectively to overcome common challenges.

Artificial intelligence resources