What is parameter-efficient fine-tuning (PEFT)?

Copy URL

Large language models (LLMs) require computational resources and money to operate. Parameter-efficient fine-tuning (PEFT) is a set of techniques that adjusts only a portion of parameters within an LLM to save resources. 

PEFT makes LLM customization more accessible while creating outputs that are comparable to a traditional fine-tuned model. 

Explore Red Hat AI

Fine-tuning and PEFT are both LLM alignment techniques. They adjust and inform an LLM with the data you want, to produce the output you want. You can think of PEFT as an evolution of traditional fine-tuning.

Traditional fine-tuning makes adjustments to an LLM by further training the entire model. This requires intensive computational resources, data, and time. 

Comparatively, PEFT only modifies a small portion of parameters within a model, making it generally more accessible for organizations without extensive resources. 

Red Hat AI

PEFT provides the benefit of training large models, faster, on smaller hardware. 

Specifically, benefits of PEFT include:

  • Faster training speed: When fewer parameters are updated, PEFT allows for quicker experimentation and iteration.
  • Resource-efficient: PEFT uses much less GPU memory than traditional fine-tuning and can run on consumer-grade hardware. This means you can train an LLM on a laptop rather than needing a dedicated server.
  • Ability to overcome catastrophic forgetting: Catastrophic forgetting happens when the model forgets the knowledge it’s already learned when provided with new training data. PEFT helps models avoid catastrophic forgetting because it only updates a few parameters rather than the whole model.
  • Portable: Models tuned with PEFT are smaller, more manageable, and easier to deploy across platforms. This makes the model easier to update and improve in an operational environment.
  • Sustainable: PEFT aligns with eco-friendly operational goals by using fewer computational resources.
  • Accessible: Teams and organizations with fewer computational resources can fine-tune models and still achieve a desirable result.

LLMs are composed of multiple neural network layers. Think of these layers as a type of flow chart, starting with an input layer and ending with an output layer. Sandwiched between these 2 layers are many other layers, each playing a role to process data as it moves through the neural network.

If you want to adjust the way a language model processes information, you change the parameters. 

PEFT technique: How to optimize LLMs with GPUs

What are parameters in an LLM?

Parameters (sometimes called weights) shape an LLM’s understanding of language. 

Think of parameters like an adjustable gear within a machine. Each parameter has a specific numerical value–the shifting of which affects the model’s ability to interpret and generate language. 

An LLM can contain billions (even hundreds of billions) of parameters. The more parameters a model has, the more complex the tasks it can perform. 

However, as the number of parameters in a model increases, so does the need for hardware resources. Organizations may not have the means to invest in these hardware requirements, which is why tuning techniques like PEFT are so important. 

To increase model efficiency, learn how to eliminate unnecessary parameters while maintaining accuracy.

Fine-tuning parameters, efficiently

PEFT strategically modifies only a small number of parameters while preserving most of the pretrained model’s structure. Some examples of ways to make these adjustments include:

Freezing model layers: During inference, calculations are sent through all the layers of a neural network. By freezing some of those layers, you cut down on some of the processing power needed to perform calculations. 

Adding adapters: Think of adapters like an expansion pack for a board game. Adapters are added on top of the layers within the pre-trained model and  trained to learn domain- or application-specific information. In this scenario, the original model doesn’t change, but gains new capabilities. 

There are several methods used to perform PEFT, including:

  • LoRA (low-rank adaptation)
  • QLoRA (quantized low-rank adaptation)
  • Prefix tuning
  • Prompt tuning
  • P-tuning

Learn about LoRA vs QLoRA

A leading tool in this space is vLLM. vLLM is a memory-efficient inference server and engine, designed to improve the speed and processing power of large language models in a hybrid cloud setting. vLLM’s support for PEFT, specifically for serving multiple LoRA adapters, provides a massive efficiency boost by allowing 1 base model to remain loaded in the GPU memory. 

Using vLLM to serve PEFT allows 1 model to serve multiple fine-tuned versions simultaneously. In other words, PEFT creates small files, and vLLM optimizes the serving of those files by sharing and distributing memory resources–like the key-value (KV) cache–from a singular underlying model. 

Learn more about vLLM

Fine-tuning is a way to communicate intent to an LLM so the model can tailor its output to fit your goals.

Consider this: An LLM might be able to write an email in the style of Shakespeare, but it doesn’t know anything about the details of the products your company provides.

To train the model with your unique information, you can use fine-tuning. 

Fine-tuning is the process of training a pretrained model further with a more tailored data set so it can effectively perform unique tasks. This additional training data modifies the model’s parameters and creates a new version that replaces the original model.

Fine-tuning is critical to personalizing an LLM for a domain-specific use case. However, traditional fine-tuning comes at a cost. 

Why is fine-tuning expensive?

Several factors contribute to the cost of fine-tuning an LLM, such as:

  • GPU requirements: Fine-tuning demands a lot of processing power. Graphic processing units (GPUs) are expensive to purchase and operate, and they need to be running for extended periods of time during the fine-tuning process. Power consumption and cooling can also be costly.
  • Data requirements: Data sets needed to fine-tune an LLM with new information must be high quality and properly labelled. Acquiring, building, and pre-processing this data can be expensive and time consuming. 

LLM alignment refers to the process of training and personalizing a language model to produce the outputs that you want.

When deciding between different LLM alignment techniques, consider the following factors:

  • Data dependency: How much data is needed? Do you have access to the data required for this technique to work?
  • Accuracy: How much does this technique impact the accuracy of the model after tuning?
  • Complexity for users: How easy is it to use?

Compared to traditional fine-tuning, PEFT requires less data, has very high accuracy rates, and is more user-friendly. 

Another LLM alignment option to explore is retrieval-augmented generation (RAG). RAG provides a means to supplement the data that exists within an LLM with external knowledge sources of your choosing—such as data repositories, collections of text, and pre-existing documentation. RAG has a high data dependency, but has high accuracy rates and is less complex to use than fine-tuning. 

Read about RAG vs. fine-tuning.

Parameter-efficient fine-tuning is 1 of several alignment techniques supported on Red Hat® OpenShift® AI.

OpenShift AI is a flexible, scalable MLOps platform with tools to build, deploy, and manage AI-enabled applications. OpenShift AI supports the full lifecycle of AI/ML experiments and models, on-premise and in the public cloud.

Learn more about Red Hat OpenShift AI

Red Hat® AI is a portfolio of products and services that can help your enterprise at any stage of the AI journey - whether you’re at the very beginning or ready to scale across the hybrid cloud. It can support both generative and predictive AI efforts for your unique enterprise use cases.

Red Hat AI is powered by open source technologies and a partner ecosystem that focuses on performance, stability, and GPU support across various infrastructures. It offers efficient tuning of small, fit-for-purpose models with the flexibility to deploy wherever your data resides.

Blog post

Creating cost effective specialized AI solutions with LoRA adapters on Red Hat OpenShift AI

Low-rank adaptation (LoRA) is a fine-tuning technique that promotes efficiency and saves resources

Red Hat OpenShift AI

An artificial intelligence (AI) platform that provides tools to rapidly develop, train, serve, and monitor models and AI-enabled applications.

Keep reading

Agentic AI vs. generative AI

Agentic AI and generative AI explained: Learn how each works, their unique strengths, and how they can collaborate for smarter solutions.

How vLLM accelerates AI inference: 3 enterprise use cases

This article highlights 3 real-world examples of how well-known companies are successfully using vLLM.

What is machine learning?

Machine learning is the technique of training a computer to find patterns, make predictions, and learn from experience without being explicitly programmed.

Artificial intelligence resources