What if you could get similar results from your large language model (LLM) with 75% less GPU memory? In my previous article,, we discussed the benefits of smaller LLMs and some of the techniques for shrinking them. In this article, we’ll put this to test by comparing the results of the smaller and larger versions of the same LLM.
As you’ll recall, quantization is one of the techniques for reducing the size of a LLM. Quantization achieves this by representing the LLM parameters (e.g. weights) in lower precision formats: from 32-bit floating point (FP32) to 8-bit integer (INT8) or INT4. The tradeoff could be a possible reduction in accuracy, so the goal is to achieve these savings with a reasonable reduction in the accuracy of predictions.
Are quantized LLMs accurate enough?
All the benefits of smaller LLMs are moot if the results are not accurate enough to be useful. There are a number of benchmarks available that compare measure model accuracy, but for the sake of simplicity, let’s manually inspect the quality of responses for non-quantized and quantized LLMs.
Setup
AWS G6.12x large instance (4 NVIDIA L4 TensorCore GPUs, 96GB GPU memory, 48 vCPUs, 192GB CPU memory). We’ll be using the Mistral 7B Instruct LLM with FP32, INT8 and INT4. The model will be served with Ollama and the system performance is monitored with NVTop utility.
Let’s get started with a set of prompts focusing on different areas of knowledge.
Prompt #1: In 20 words or less, describe the theory of relativity
FP32 | INT8 | INT4 |
Groundbreaking work by Albert Einstein; connected space, time, gravity, & energy, challenging traditional science views. Consists of special & general relativity theories. | The theory of relativity describes the relationship between space and time, including the effects of gravity on both, and sets the foundational principles for modern physics. | The theory of relativity describes the laws of physics in a frame of reference where space and time are interconnected; it involves the famous equation E=mc^2. |
Observation: For generic factual information, the responses are accurate and similar for FP32, INT8 and INT4.
Prompt #2: Write a Python function that calculates the factorial of a number
FP32 | INT8 | INT4 |
python def factorial(n): if n == 0: return 1 else: return n * factorial(n-1) | python def factorial(n): if n == 0: return 1 else: return n * factorial(n-1) | python def factorial(n): if n == 0: return 1 else: return n * factorial(n-1) |
Observation: For coding, the responses are accurate and identical for FP32, INT8 and INT4
Prompt #3: Write a Python function that sorts numbers 8, 1, 9, 2, 5, 4 in ascending order
FP32 | INT8 | INT4 |
def sort_numbers(numbers): sorted_numbers = sorted(numbers) return sorted_numbers numbers = [8, 1, 9, 2, 5, 4] sorted_numbers = sort_numbers(numbers) print(sorted_numbers) | def sort_numbers(numbers): sorted_numbers = sorted(numbers) return sorted_numbers numbers = [8, 1, 9, 2, 5, 4] sorted_numbers = sort_numbers(numbers) print(sorted_numbers) | def sort_numbers(numbers): sorted_numbers = sorted(numbers) return sorted_numbers numbers = [8, 1, 9, 2, 5, 4] sorted_numbers = sort_numbers(numbers) print(sorted_numbers) |
Observation: For coding, the responses are accurate and identical for FP32, INT8 and INT4.
Prompt #4: In 10 words or less, describe which country won the most total medals in the Olympics 2024
FP32 | INT8 | INT4 |
Not specified; as of the time of this response, the Olympic Games 2024 have not yet taken place. The country that won the most total medals and gold medals will be determined during the competition itself. | I can't predict future events, but as of now, no Olympics have taken place in the year 2024. So it is not possible for me to answer this question at the moment. | Not specified yet as the Olympic Games 2024 have not taken place. |
Observation: The LLM acknowledges lack of information as it was trained before the event took place. INT4 (12 words) response is most concise and FP32 is most verbose (36 words).
Prompt #5: In 10 words or less, describe OPC/UA protocol and where it is used
FP32 | INT8 | INT4 |
OPC UA (open platform communications unified architecture) is a standardized communication protocol for industrial automation systems; it is used in manufacturing, energy management, building automation, and other related industries. | OPC UA (open platform communications unified architecture) is a standardized industrial communication protocol for interoperability among devices and systems, widely used in manufacturing automation and IoT. | OPC UA (open platform communication unified architecture) is a cross-platform data exchange standard for industrial automation; it is used in manufacturing industries for communication between devices and systems. |
Observation: For a topic related to a specific industry segment (e.g. IoT), the responses are accurate and quite similar for FP32, INT8 and INT4.
Prompt #6: Summarize this in a simple sentence:
“The dominant sequence transduction models are based on complex recurrent or convolutional neural networks that include an encoder and a decoder. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train. Our model achieves 28.4 BLEU on the WMT 2014 English-to-German translation task, improving over the existing best results, including ensembles, by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.8 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature. We show that the transformer generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.”
FP32 | INT8 | INT4 |
The new transformer model based solely on attention mechanisms outperforms existing machine translation models in quality, parallelizability, and training time while also generalizing well to other tasks like English constituency parsing. | The paper introduces a new deep learning model called transformer for machine translation tasks, which outperforms existing models in quality while being more parallelizable and requiring less time to train, achieving state-of-the-art BLEU scores on WMT 2014 English-to-German and English-to-French translations. The model also demonstrates generalization to other tasks like English constituency parsing. | The paper introduces a new neural network architecture called the transformer that outperforms existing sequence transduction models, particularly in machine translation tasks, while being more efficient and parallelizable. |
Observation: The task to summarize the abstract of Attention is all you need paper, the responses are accurate and pretty similar. INT8 has the most lengthy response that’s two sentences long vs. one for FP32 and INT4.
Prompt #7: A farmer has 17 chickens and cows in total. If the total number of legs is 44, how many cows are there, given that each chicken has 2 legs and each cow has 4 legs, and there are no other animals present?
FP32 | INT8 | INT4 |
To solve this problem, we need to find the number of chickens and cows from the given information about their total number of legs. Chickens have 2 legs each, so if there are C chickens in total, then the number of chicken legs is: C * 2 = 17 (total number of chicken legs) C = 8.5 [......Response truncated] In this case, we have found a possible solution with C = 9 chickens and cows = 6. It's important to note that there could be another valid solution by rounding down the number of chickens instead. However, assuming no other animals are present, our current answer seems plausible since the total number of animals (chickens + cows) is 15 in both cases, which is within the range of possible solutions based on the total number of legs provided. | There are 8 cows, since each chicken has 2 legs (17 * 2 = 34) and each cow has 4 legs, so the total number of legs for all cows is 8 \* 4 = 32. Therefore, the total number of legs minus the number of chicken legs should equal the number of cow legs, which gives us (44 - 34) = 10. Since each cow has 4 legs, there are 10 / 4 = 8 cows in total. | There are 6 cows since each cow has 4 legs and the total number of legs for all the animals (chickens and cows) is 44. To find out how many chickens there are: 1. We know that there are 6 cows with a total of 24 legs (since each cow has 4 legs). 2. This leaves us with 18 legs for the chickens (44 - 24 = 18). 3. Since each chicken has 2 legs, there must be 9 chickens (18 / 2 = 9). So, there are 6 cows and 9 chickens in total. |
Note: Correct answer is 12 chickens and 5 cows (12*2 + 5*4 =44)
Observation: All versions of LLM are pretty bad at the arithmetic problem! The only difference being INT8 (71 words) and INT4 (91 words) answers are a little bit more concise than FP32 (272 words).
Savings with a quantized LLM
We can see from the above examples that quantized LLMs are reasonably accurate in their predictions. Now, let’s look at the GPU resource consumption during inference for the various versions of LLMs.
Now let’s look at comparison of resource consumption across different versions of the same LLM.
The GPU consumption during inference reduces slightly as move from FP32 to INT4 (28% to 24%) but the GPU memory consumption reduces dramatically (37% to 8%).
What about enterprise use cases?
For an enterprise looking to benefit from generative AI, LLM needs to be benchmarked for specific use cases rather than encyclopedic knowledge or mathematical skills. These use cases could encompass tasks like customer service chatbot or text summarization that’s focused on a specific domain or company data.
New hire orientation for LLM
An off-the shelf LLM is like a new college graduate, both have a good general knowledge and natural language capabilities. When a new college graduate joins a company, the employee needs to be trained to become familiar with the company (and industry) specific knowledge or tasks before they can be productive. For example, when someone hears the word “cloud”, one would most likely think of other related words (white, dark, sky, rain, storm…). The context is a key factor in determining the priority of these related words. For a LLM, this priority is denoted by its parameters (e.g. weights).
For an IT enterprise, the term “cloud" will have different weight than an agriculture business. For an employee (or LLM) in an IT company, the word “cloud” will have higher weights for related words like computing, AWS and Azure. While an employee (or LLM) for an agricultural business, the word “cloud” will have higher weights for related words like rain, precipitation and storm.
The in-job training for a new employee is similar to fine tuning a LLM. In both instances, the goal is to get better at specific tasks within a given domain. For an employee, this could entail in-house training or shadowing other experienced colleagues. For an LLM, this means fine-tuning the model with a specific set of data or tasks that will adjust the weights resulting in improved performance.
Note: To be fair, it’s worth mentioning that it’s possible to add context or specific knowledge when querying the LLM through retrieval augmented generation (RAG). But providing a context with every query is cumbersome and doesn’t support fast response applications like a support chatbot. This means fine-tuning an LLM with company specific knowledge or tasks becomes a necessity for enterprise use cases.
There are several approaches to fine tuning an LLM, including reinforcement learning from human feedback (RLHF), knowledge distillation with student/teacher models, or parameter efficient fine tuning with LoRA. How can an enterprise fine-tune an off-the shelf LLM without incurring the massive costs (data science skills, infrastructure resources) involved with retraining a model?
Fine-tuning a LLM for common people
InstructLab is an open source project that allows every day users (as opposed to data scientists) to add new knowledge and skills to AI models using a simple question and answer format. This could be used by enterprises who want to customize an LLM with company-specific knowledge or tasks. Let’s put it to practice with sample data from this repository.
We’ll assume this is an insurance company customer service chatbot being asked about information that it doesn’t already have (in this case, a fictional car from the movie, “Back to the Future”). Once again, we’ll use the Mistral 7B Instruct model that we had used earlier.
Prompt (pre fine-tuning): How much does the flux capacitor cost for DMC-12
The cost of a flux capacitor for a DeLorean DMC-12, as depicted in the Back to the Future movie series, is not specified and is generally considered fictional. In reality, similar time travel technology is not currently available or priced. |
Observation: The LLM accurately points out the lack of information for this fictional item.
Now let’s fine-tune this LLM by adding the missing knowledge (i.e. the cost of the flux capacitor ($10,000,000)). In reality, this could be any proprietary data an enterprise may want to add e.g. a product catalog, actuarial table or medical insurance coverage rates.
Now let’s see the results after fine tuning the LLM.
Prompt (post fine-tuning): How much does the flux capacitor cost for DMC-12
The cost of a flux capacitor is $10,000,000. |
Observation: The LLM accurately provides the information provided during the fine tuning of the model. Enterprises can affect the behavior of LLM by tuning it with their proprietary data or customizing its responses tailored to specific tasks.
If you’re interested in learning more about how to use InstructLab to add new knowledge and skills to a LLM of your choice, check out the interactive step-by-step InstructLab demo here.
About the author
Ishu Verma is Technical Evangelist at Red Hat focused on emerging technologies like edge computing, IoT and AI/ML. He and fellow open source hackers work on building solutions with next-gen open source technologies. Before joining Red Hat in 2015, Verma worked at Intel on IoT Gateways and building end-to-end IoT solutions with partners. He has been a speaker and panelist at IoT World Congress, DevConf, Embedded Linux Forum, Red Hat Summit and other on-site and virtual forums. He lives in the valley of sun, Arizona.
Browse by channel
Automation
The latest on IT automation for tech, teams, and environments
Artificial intelligence
Updates on the platforms that free customers to run AI workloads anywhere
Open hybrid cloud
Explore how we build a more flexible future with hybrid cloud
Security
The latest on how we reduce risks across environments and technologies
Edge computing
Updates on the platforms that simplify operations at the edge
Infrastructure
The latest on the world’s leading enterprise Linux platform
Applications
Inside our solutions to the toughest application challenges
Original shows
Entertaining stories from the makers and leaders in enterprise tech
Products
- Red Hat Enterprise Linux
- Red Hat OpenShift
- Red Hat Ansible Automation Platform
- Cloud services
- See all products
Tools
- Training and certification
- My account
- Customer support
- Developer resources
- Find a partner
- Red Hat Ecosystem Catalog
- Red Hat value calculator
- Documentation
Try, buy, & sell
Communicate
About Red Hat
We’re the world’s leading provider of enterprise open source solutions—including Linux, cloud, container, and Kubernetes. We deliver hardened solutions that make it easier for enterprises to work across platforms and environments, from the core datacenter to the network edge.
Select a language
Red Hat legal and privacy links
- About Red Hat
- Jobs
- Events
- Locations
- Contact Red Hat
- Red Hat Blog
- Diversity, equity, and inclusion
- Cool Stuff Store
- Red Hat Summit