What is parameter efficient fine-tuning (PEFT)?
Large language models (LLMs) require computational resources and money to operate. Parameter-efficient fine-tuning (PEFT) is a set of techniques that adjusts only a portion of parameters within an LLM to save resources.
PEFT makes LLM customization more accessible while creating outputs that are comparable to a traditional fine-tuned model.
Traditional fine-tuning vs PEFT
Fine-tuning and PEFT are both LLM alignment techniques. They adjust and inform an LLM with the data you want, to produce the output you want. You can think of PEFT as an evolution of traditional fine-tuning.
Traditional fine-tuning makes adjustments to an LLM by further training the entire model. This requires intensive computational resources, data, and time.
Comparatively, PEFT only modifies a small portion of parameters within a model, making it generally more accessible for organizations without extensive resources.
Red Hat AI
What are the benefits of PEFT?
PEFT provides the benefit of training large models, faster, on smaller hardware.
Specifically, benefits of PEFT include:
- Faster training speed: When fewer parameters are updated, PEFT allows for quicker experimentation and iteration.
- Resource-efficient: PEFT uses much less GPU memory than traditional fine-tuning and can run on consumer-grade hardware. This means you can train an LLM on a laptop rather than needing a dedicated server.
- Ability to overcome catastrophic forgetting: Catastrophic forgetting happens when the model forgets the knowledge it’s already learned when provided with new training data. PEFT helps models avoid catastrophic forgetting because it only updates a few parameters rather than the whole model.
- Portable: Models tuned with PEFT are smaller, more manageable, and easier to deploy across platforms. This makes the model easier to update and improve in an operational environment.
- Sustainable: PEFT aligns with eco-friendly operational goals by using fewer computational resources.
- Accessible: Teams and organizations with fewer computational resources can fine-tune models and still achieve a desirable result.
How does PEFT work?
LLMs are composed of multiple neural network layers. Think of these layers as a type of flow chart, starting with an input layer and ending with an output layer. Sandwiched between these 2 layers are many other layers, each playing a role to process data as it moves through the neural network.
If you want to adjust the way a language model processes information, you change the parameters.
What are parameters in an LLM?
Parameters (sometimes called weights) shape an LLM’s understanding of language.
Think of parameters like an adjustable gear within a machine. Each parameter has a specific numerical value–the shifting of which affects the model’s ability to interpret and generate language.
An LLM can contain billions (even hundreds of billions) of parameters. The more parameters a model has, the more complex the tasks it can perform.
However, as the number of parameters in a model increases, so does the need for hardware resources. Organizations may not have the means to invest in these hardware requirements, which is why tuning techniques like PEFT are so important.
Fine-tuning parameters, efficiently
PEFT strategically modifies only a small number of parameters while preserving most of the pretrained model’s structure. Some examples of ways to make these adjustments include:
Freezing model layers: During inference, calculations are sent through all the layers of a neural network. By freezing some of those layers, you cut down on some of the processing power needed to perform calculations.
Adding adapters: Think of adapters like an expansion pack for a board game. Adapters are added on top of the layers within the pre-trained model and trained to learn domain- or application-specific information. In this scenario, the original model doesn’t change, but gains new capabilities.
There are several methods used to perform PEFT, including:
- LoRA (low-rank adaptation)
- QLoRA (quantized low-rank adaptation)
- Prefix tuning
- Prompt tuning
- P-tuning
What is fine-tuning?
Fine-tuning is a way to communicate intent to an LLM so the model can tailor its output to fit your goals.
Consider this: An LLM might be able to write an email in the style of Shakespeare, but it doesn’t know anything about the details of the products your company provides.
To train the model with your unique information, you can use fine-tuning.
Fine-tuning is the process of training a pretrained model further with a more tailored data set so it can effectively perform unique tasks. This additional training data modifies the model’s parameters and creates a new version that replaces the original model.
Fine-tuning is critical to personalizing an LLM for a domain-specific use case. However, traditional fine-tuning comes at a cost.
Why is fine-tuning expensive?
Several factors contribute to the cost of fine-tuning an LLM, such as:
- GPU requirements: Fine-tuning demands a lot of processing power. Graphic processing units (GPUs) are expensive to purchase and operate, and they need to be running for extended periods of time during the fine-tuning process. Power consumption and cooling can also be costly.
- Data requirements: Data sets needed to fine-tune an LLM with new information must be high quality and properly labelled. Acquiring, building, and pre-processing this data can be expensive and time consuming.
Which LLM alignment technique is right for me?
LLM alignment refers to the process of training and personalizing a language model to produce the outputs that you want.
When deciding between different LLM alignment techniques, consider the following factors:
- Data dependency: How much data is needed? Do you have access to the data required for this technique to work?
- Accuracy: How much does this technique impact the accuracy of the model after tuning?
- Complexity for users: How easy is it to use?
Compared to traditional fine-tuning, PEFT requires less data, has very high accuracy rates, and is more user-friendly.
Other LLM alignment options to explore include:
- Retrieval-augmented generation (RAG): RAG provides a means to supplement the data that exists within an LLM with external knowledge sources of your choosing—such as data repositories, collections of text, and pre-existing documentation.
- RAG has a high data dependency, but has high accuracy rates and is less complex to use than fine-tuning. Read about RAG vs. fine-tuning.
- InstructLab: Created by IBM and Red Hat, the InstructLab community project lets anyone in an organization contribute knowledge and skills that all get built together into a language model.
- InstructLab has a low data dependency because it uses synthetic data to supplement human-generated data. Its accuracy is comparable to fine-tuning, and complexity for users is very low.
How Red Hat can help
Parameter-efficient fine-tuning is 1 of several alignment techniques supported on Red Hat® OpenShift® AI.
OpenShift AI is a flexible, scalable MLOps platform with tools to build, deploy, and manage AI-enabled applications. OpenShift AI supports the full lifecycle of AI/ML experiments and models, on-premise and in the public cloud.
Learn more about Red Hat OpenShift AI
Red Hat® AI is a portfolio of products and services that can help your enterprise at any stage of the AI journey - whether you’re at the very beginning or ready to scale across the hybrid cloud. It can support both generative and predictive AI efforts for your unique enterprise use cases.
Red Hat AI is powered by open source technologies and a partner ecosystem that focuses on performance, stability, and GPU support across various infrastructures. It offers efficient tuning of small, fit-for-purpose models with the flexibility to deploy wherever your data resides.
Creating cost effective specialized AI solutions with LoRA adapters on Red Hat OpenShift AI
Low-rank adaptation (LoRA) is a fine-tuning technique that promotes efficiency and saves resources