In the first part of this article, you learned some key concepts, tested InstructLab and successfully chatted with the out-of-the-box model. In this article, I'll show you how to infuse your knowledge into the model, using a sample dataset to train it using some Brazilian soccer teams data.
Preparing your system for training
If you were running/chatting with your model, ensure to stop both the chat and the server instances.
If you are running Red Hat Enterprise Linux AI (RHEL AI), elevate yourself to root:
sudo su -
Now, download the model that we will be using as the teacher (the question generator, to leverage its knowledge and create diverse questions) model:
ilab model download --repository docker://registry.redhat.io/rhelai1/mixtral-8x7b-instruct-v0-1 --release latest
If you are using the upstream InstructLab, first you'll need a free account on Hugging Face, which is a repository for many AI resources. Create an account here. Then, create a read-only token here. Take note of the token somewhere, because you won’t be able to retrieve it again.
Now, download the Mistral 7B GGUF teacher/critic model that we will be using in SDG. Be sure to add your Hugging Face token:
ilab model download --hf-token--repository TheBloke/Mistral-7B-Instruct-v0.2-GGUF --filename mistral-7b-instruct-v0.2.Q4_K_M.gguf
Training
Synthetic data generation (SDG)
After ilab taxonomy diff
happily took your qna.yaml
that you crafted in the first part of this tutorial and you make sure that all your Github ducks are in a row, the next step is to generate the synthetic data. SDG makes use of the teacher model to generate new examples of possible questions and answers about the new data that you are providing to the model.
For each SDG process that you run, you leave several artifacts in the $ILAB_PATH/datasets/
directory. And these artifacts are sizable. Unless you plan to reuse some past generated data, delete the files from the directory (only the files, not the directory!) before starting a new SDG to avoid confusion and save space.
Now run the following command, prefixed by the time
command, so you can see how long it takes:
time ilab data generate
What happens if you run ilab data generate
and it fails with some error. How do you figure out what happened?
In this case, run the command again with new parameters: ilab data generate --enable-serving-output
. This will show you why SDG failed.
If you found a CUDA error in the messages, the data generation failed because your machine doesn't meet RHEL AI requirements. If you couldn't see a fit when you ran ilab config init
this is why. You need several capable GPUs - or run it with config zero, without GPUs - for SDG to be successful. You can verify your system capabilities by running nvidia-smi
and verifying that your GPUs are listed and match the RHEL AI requirements.
If your machine does meet the requirements, you might have left an ilab model serve
running. Stop it and try again.
With a g6e.12xlarge system with 4x L40S NVIDIA GPUs, synthetic data generation will take around 3-6 minutes. It will take about an hour on a consumer-grade system.
Train
A word of caution here. If you don’t have any GPUs (NVIDIA, AMD, Macbook or Intel Gaudi), you will need at least 32 GB of RAM plus 64 GB of swap. While testing these procedures in consumer-grade systems, I found that you can happily chat and do the SDG phase on systems with 8 or 16 GB of RAM. But the train phase requires a lot on the hardware side.
While the 64 GB of swap can be eye-popping, I got training to run on an X1 Carbon with 32 GB of RAM, 64 GB of swap without any I/O thrashing. The kernel will efficiently put the “cold” memory pages into the swap device and reclaim them on demand. A training session using my example data on this kind of hardware is estimated to take around 48 hours. My recommendation? Run the SDG/training on hardware with GPUs (rent one in your favorite cloud provider), copy the trained model (more on this later), and then test it in your lower spec system.
During my first attempt to train the model, my test system was shut down by our internal platform timer (that prevents runaway instances). After 4 hours running, it didn’t finish, even with plenty of GPUs and hundreds of gigabytes of RAM. But I saw on Grant’s YouTube video that he was running similar training on his Macbook in reasonable time—was it cooled to superconductor temperatures and extremely overclocked?
After some investigation, I found that RHEL AI does the thorough training - with a multi-phase training that will change and repack the whole model, while the upstream InstructLab runs the “test grade” training using low-rank adaptation (LoRA) , which basically “layers” over the base model. So, how do we do the quick training?
Since we’re working on RHEL AI, we will be doing an epoch-based training. Think of each epoch as a complete pass through the entire model. The more epochs, the higher the accuracy of the trained dataset. More epochs, of course, require more time to train the model, and after a certain point you run into diminishing returns. This being the case, start with 10 epochs. Why 10? This should be enough for testing, while staying within a reasonable time frame for training.
Putting it all together
Now you have a fresh SDG, everything's in place for the training phase. In a RHEL AI machine, run:
time ilab model train --data-path $ILAB_PATH/datasets/knowledge_train_msgs_[TIMESTAMP].jsonl --device=cuda --num-epochs 10
If you are running consumer-grade system using upstream InstructLab, run:
time ilab model train --data-path $ILAB_PATH/datasets/knowledge_train_msgs_[TIMESTAMP].jsonl --pipeline=full --effective-batch-size=64 --max-batch-len=5000
Caution: do not reuse (via bash history) commands from previous training runs - you might end up reusing old synthetic data instead of your latest. I know this because I was getting stale data, only to discover that I was training using old SDGs.
This phase should take something around 15 minutes in a system with four GPUs, if the amount of trained data (and the underlying hardware) is more or less the same as mine. On a Mac laptop it should take around 3 hours. In a system without GPUs, it will take around 2 days.
Testing it
After a successful training phase, you are now ready to test your trained model with your own data!
First, serve the model:
ilab model serve --model-path $ILAB_PATH/checkpoints/hf_format/samples_NNNN
Remember when I told you about the round of epoch training to find the most suitable result? At $ILAB_PATH/checkpoints/hf_format/
you will find each epoch run - InstructLab system keeps a copy of each iteration. Start with the NNNN with the highest number that you have at the hf_format
directory.
Wait for the server to start, and only then (you’ll see an INFO: Application startup complete.
message), chat to your model:
ilab model chat --model $ILAB_PATH/checkpoints/hf_format/samples_NNNN
Caution: Do not append a trailing / to the samples directory; ilab model chat
doesn’t like it.
Moving forward
Is it performing the way you planned? Awesome! What if it is not? Try one of the lower-number epochs, maybe one of the other iterations got your data right.
Run your tests. Test it, abuse it, see the shortcomings (you’ll spot these very quickly) and shuffle through the epoch rounds. If you are still not satisfied, apply some anthropology to find out what in your data could be confusing your model and retrain it until you are happy with the results.
To restart everything:
- Move back to your main knowledge Markdown document and update it
- Take note of the commit ID, update that in
qna.yaml
- If you changed a block in the Markdown file that is part of questions & answers block in
qna.yaml
, ensure to reflect the changes in the Q&As too Wipe out existing SDG and trained models:
rm -rf $ILAB_PATH/checkpoints/hf_format/*
rm -rf $ILAB_PATH/datasets/*- Run a new SDG
- Run a new train (note the new SDG timestamp)
- Test again
- Rinse and repeat if needed
It’s an iterative process. Yes, it’s a journey. It took me several days, 11:30 PM and 6 AM commits, dreaming with solutions (and writing these down), and, of course, the daunting pressure of a deadline.
You may find out that an ilab model chat
session might lean towards a set of answers over the course of the session. This is by design. If you want a fresh start, end the current session (exit) and start another ilab model chat
session. No need to restart the whole model at the server side.
Serving it elsewhere
The machine that you used to train your model might be an expensive one. You don’t necessarily need that level of system to serve your model.
My suggestion is to serve your model with a more modest system. Here's how to do that.
Pack your golden model:
tar -cvf ~/trained-granite-model.tar $ILAB_PATH/checkpoints/hf_format/samples_NNNN
- Go to your leaner machine and
scp
the model to it - Run an
ilab config init
and select the proper configuration cd
to/
and decompress your tar.gz file (so it can retain the original directory tree)Serve the model, limiting its usage and requirements (and you’re welcome for this):
ilab model serve --model-path $ILAB_PATH/checkpoints/hf_format/samples_NNNN --backend=vllm -- --swap-space=1 --tensor-parallel-size=1
Open another session for the chat client (after the server-side is up and running):
ilab model chat --model $ILAB_PATH/checkpoints/hf_format/samples_NNNN
- Try and test it!
Note that the model doesn’t have to retain the original directory name, you can save it somewhere else or give a different name.
Some final tips
- I found that the disk speed is a key factor, as lots of data is moved around
- The RHEL AI machine is based on bootc containers, with a tiny-sized root partition. If you need extra instrumentation and packages, in this order:
- Run
subscription-manager register
- Run
bootc usroverlay
to create a temporary space where you can add the new binary data - Then run your yum/dnf commands
- Note: The installed packages don't persist across reboots
- RHEL AI 1.2 has a known issue when running yum/dnf it will present a curl error. Check the resolution in this article
- Run
- You might want to use the
tmux
tool when doing something that takes time and you’re using an unreliable link - Remember the tripod: resources x time x scope
I’m not a data scientist, a developer or even fluent in Python. Nope, with a tech degree in chemistry, a graduate degree in administration, and my whole background in IT as a sysadmin, I was able to develop my own AI model, tuned with my data and control the aspects of training it.
And this is exactly what the InstructLab project aims to do, to increase the accessibility, efficiency and democratization of AI. This would be unthinkable a few years ago, as it would require very specialized personnel to achieve such an objective.
Seize the moment. Be bold. Try it. Control it. Make it yours.
About the author
Rodrigo is a tenured professional with a distinguished track record of success and experience in several industries, especially high performance and mission critical environments in FSI. A negotiator at his heart, throughout his 20+ year career, he has leveraged his deep technical background and strong soft skills to deliver exceptional results for his clients and organizations - often ending in long-standing relationships as a trusted advisor. Currently, Rodrigo is deep diving on AI technology.
Browse by channel
Automation
The latest on IT automation for tech, teams, and environments
Artificial intelligence
Updates on the platforms that free customers to run AI workloads anywhere
Open hybrid cloud
Explore how we build a more flexible future with hybrid cloud
Security
The latest on how we reduce risks across environments and technologies
Edge computing
Updates on the platforms that simplify operations at the edge
Infrastructure
The latest on the world’s leading enterprise Linux platform
Applications
Inside our solutions to the toughest application challenges
Original shows
Entertaining stories from the makers and leaders in enterprise tech
Products
- Red Hat Enterprise Linux
- Red Hat OpenShift
- Red Hat Ansible Automation Platform
- Cloud services
- See all products
Tools
- Training and certification
- My account
- Customer support
- Developer resources
- Find a partner
- Red Hat Ecosystem Catalog
- Red Hat value calculator
- Documentation
Try, buy, & sell
Communicate
About Red Hat
We’re the world’s leading provider of enterprise open source solutions—including Linux, cloud, container, and Kubernetes. We deliver hardened solutions that make it easier for enterprises to work across platforms and environments, from the core datacenter to the network edge.
Select a language
Red Hat legal and privacy links
- About Red Hat
- Jobs
- Events
- Locations
- Contact Red Hat
- Red Hat Blog
- Diversity, equity, and inclusion
- Cool Stuff Store
- Red Hat Summit