InstructLab tutorial: Installing and fine-tuning your first AI model (part 2)

November 26, 2024Rodrigo Freire7-minute read

In the first part of this article, you learned some key concepts, tested InstructLab and successfully chatted with the out-of-the-box model. In this article, I'll show you how to infuse your knowledge into the model, using a sample dataset to train it using some Brazilian soccer teams data.

Preparing your system for training

If you were running/chatting with your model, ensure to stop both the chat and the server instances.

If you are running Red Hat Enterprise Linux AI (RHEL AI), elevate yourself to root:

sudo su -

Now, download the model that we will be using as the teacher (the question generator, to leverage its knowledge and create diverse questions) model:

ilab model download --repository docker://registry.redhat.io/rhelai1/mixtral-8x7b-instruct-v0-1 --release latest

If you are using the upstream InstructLab, first you'll need a free account on Hugging Face, which is a repository for many AI resources. Create an account here. Then, create a read-only token here. Take note of the token somewhere, because you won’t be able to retrieve it again.

Now, download the Mistral 7B GGUF teacher/critic model that we will be using in SDG. Be sure to add your Hugging Face token:

ilab model download --hf-token  --repository TheBloke/Mistral-7B-Instruct-v0.2-GGUF --filename mistral-7b-instruct-v0.2.Q4_K_M.gguf

Training

Synthetic data generation (SDG)

After ilab taxonomy diff happily took your qna.yaml that you crafted in the first part of this tutorial and you make sure that all your Github ducks are in a row, the next step is to generate the synthetic data. SDG makes use of the teacher model to generate new examples of possible questions and answers about the new data that you are providing to the model.

For each SDG process that you run, you leave several artifacts in the $ILAB_PATH/datasets/ directory. And these artifacts are sizable. Unless you plan to reuse some past generated data, delete the files from the directory (only the files, not the directory!) before starting a new SDG to avoid confusion and save space.

Now run the following command, prefixed by the time command, so you can see how long it takes:

time ilab data generate

What happens if you run ilab data generate and it fails with some error. How do you figure out what happened?

In this case, run the command again with new parameters: ilab data generate --enable-serving-output. This will show you why SDG failed.

If you found a CUDA error in the messages, the data generation failed because your machine doesn't meet RHEL AI requirements. If you couldn't see a fit when you ran ilab config init this is why. You need several capable GPUs - or run it with config zero, without GPUs - for SDG to be successful. You can verify your system capabilities by running nvidia-smi and verifying that your GPUs are listed and match the RHEL AI requirements.

If your machine does meet the requirements, you might have left an ilab model serve running. Stop it and try again.

With a g6e.12xlarge system with 4x L40S NVIDIA GPUs, synthetic data generation will take around 3-6 minutes. It will take about an hour on a consumer-grade system.

Train

A word of caution here. If you don’t have any GPUs (NVIDIA, AMD, Macbook or Intel Gaudi), you will need at least 32 GB of RAM plus 64 GB of swap. While testing these procedures in consumer-grade systems, I found that you can happily chat and do the SDG phase on systems with 8 or 16 GB of RAM. But the train phase requires a lot on the hardware side.

While the 64 GB of swap can be eye-popping, I got training to run on an X1 Carbon with 32 GB of RAM, 64 GB of swap without any I/O thrashing. The kernel will efficiently put the “cold” memory pages into the swap device and reclaim them on demand. A training session using my example data on this kind of hardware is estimated to take around 48 hours. My recommendation? Run the SDG/training on hardware with GPUs (rent one in your favorite cloud provider), copy the trained model (more on this later), and then test it in your lower spec system.

During my first attempt to train the model, my test system was shut down by our internal platform timer (that prevents runaway instances). After 4 hours running, it didn’t finish, even with plenty of GPUs and hundreds of gigabytes of RAM. But I saw on Grant’s YouTube video that he was running similar training on his Macbook in reasonable time—was it cooled to superconductor temperatures and extremely overclocked?

After some investigation, I found that RHEL AI does the thorough training - with a multi-phase training that will change and repack the whole model, while the upstream InstructLab runs the “test grade” training using low-rank adaptation (LoRA) , which basically “layers” over the base model. So, how do we do the quick training?

Since we’re working on RHEL AI, we will be doing an epoch-based training. Think of each epoch as a complete pass through the entire model. The more epochs, the higher the accuracy of the trained dataset. More epochs, of course, require more time to train the model, and after a certain point you run into diminishing returns. This being the case, start with 10 epochs. Why 10? This should be enough for testing, while staying within a reasonable time frame for training.

Putting it all together

Now you have a fresh SDG, everything's in place for the training phase. In a RHEL AI machine, run:

time ilab model train --data-path  $ILAB_PATH/datasets/knowledge_train_msgs_[TIMESTAMP].jsonl --device=cuda --num-epochs 10

If you are running consumer-grade system using upstream InstructLab, run:

time ilab model train --data-path  $ILAB_PATH/datasets/knowledge_train_msgs_[TIMESTAMP].jsonl --pipeline=full --effective-batch-size=64 --max-batch-len=5000

Caution: do not reuse (via bash history) commands from previous training runs - you might end up reusing old synthetic data instead of your latest. I know this because I was getting stale data, only to discover that I was training using old SDGs.

This phase should take something around 15 minutes in a system with four GPUs, if the amount of trained data (and the underlying hardware) is more or less the same as mine. On a Mac laptop it should take around 3 hours. In a system without GPUs, it will take around 2 days.

Testing it

After a successful training phase, you are now ready to test your trained model with your own data!

First, serve the model:

ilab model serve --model-path $ILAB_PATH/checkpoints/hf_format/samples_NNNN

Remember when I told you about the round of epoch training to find the most suitable result? At $ILAB_PATH/checkpoints/hf_format/ you will find each epoch run - InstructLab system keeps a copy of each iteration. Start with the NNNN with the highest number that you have at the hf_format directory.

Wait for the server to start, and only then (you’ll see an INFO: Application startup complete. message), chat to your model:

ilab model chat --model $ILAB_PATH/checkpoints/hf_format/samples_NNNN

Caution: Do not append a trailing / to the samples directory; ilab model chat doesn’t like it.

Moving forward

Is it performing the way you planned? Awesome! What if it is not? Try one of the lower-number epochs, maybe one of the other iterations got your data right.

Run your tests. Test it, abuse it, see the shortcomings (you’ll spot these very quickly) and shuffle through the epoch rounds. If you are still not satisfied, apply some anthropology to find out what in your data could be confusing your model and retrain it until you are happy with the results.

To restart everything:

Move back to your main knowledge Markdown document and update it
Take note of the commit ID, update that in qna.yaml
If you changed a block in the Markdown file that is part of questions & answers block in qna.yaml, ensure to reflect the changes in the Q&As too

Wipe out existing SDG and trained models:

rm -rf $ILAB_PATH/checkpoints/hf_format/*
rm -rf $ILAB_PATH/datasets/*

Run a new SDG
Run a new train (note the new SDG timestamp)
Test again
Rinse and repeat if needed

It’s an iterative process. Yes, it’s a journey. It took me several days, 11:30 PM and 6 AM commits, dreaming with solutions (and writing these down), and, of course, the daunting pressure of a deadline.

You may find out that an ilab model chat session might lean towards a set of answers over the course of the session. This is by design. If you want a fresh start, end the current session (exit) and start another ilab model chat session. No need to restart the whole model at the server side.

Serving it elsewhere

The machine that you used to train your model might be an expensive one. You don’t necessarily need that level of system to serve your model.

My suggestion is to serve your model with a more modest system. Here's how to do that.

Pack your golden model:

tar -cvf ~/trained-granite-model.tar $ILAB_PATH/checkpoints/hf_format/samples_NNNN

Go to your leaner machine and scp the model to it
Run an ilab config init and select the proper configuration
cd to / and decompress your tar.gz file (so it can retain the original directory tree)

Serve the model, limiting its usage and requirements (and you’re welcome for this):

ilab model serve --model-path $ILAB_PATH/checkpoints/hf_format/samples_NNNN --backend=vllm -- --swap-space=1 --tensor-parallel-size=1

Open another session for the chat client (after the server-side is up and running):
```
ilab model chat --model $ILAB_PATH/checkpoints/hf_format/samples_NNNN
```
Try and test it!

Note that the model doesn’t have to retain the original directory name, you can save it somewhere else or give a different name.

Some final tips

I found that the disk speed is a key factor, as lots of data is moved around
The RHEL AI machine is based on bootc containers, with a tiny-sized root partition. If you need extra instrumentation and packages, in this order:
1. Run subscription-manager register
2. Run bootc usroverlay to create a temporary space where you can add the new binary data
3. Then run your yum/dnf commands
  - Note: The installed packages don't persist across reboots
  - RHEL AI 1.2 has a known issue when running yum/dnf it will present a curl error. Check the resolution in this article
You might want to use the tmux tool when doing something that takes time and you’re using an unreliable link
Remember the tripod: resources x time x scope

I’m not a data scientist, a developer or even fluent in Python. Nope, with a tech degree in chemistry, a graduate degree in administration, and my whole background in IT as a sysadmin, I was able to develop my own AI model, tuned with my data and control the aspects of training it.

And this is exactly what the InstructLab project aims to do, to increase the accessibility, efficiency and democratization of AI. This would be unthinkable a few years ago, as it would require very specialized personnel to achieve such an objective.

Seize the moment. Be bold. Try it. Control it. Make it yours.

About the author

Rodrigo Freire

Chief Architect

Rodrigo is a tenured professional with a distinguished track record of success and experience in several industries, especially high performance and mission critical environments in FSI. A negotiator at his heart, throughout his 20+ year career, he has leveraged his deep technical background and strong soft skills to deliver exceptional results for his clients and organizations - often ending in long-standing relationships as a trusted advisor. Currently, Rodrigo is deep diving on AI technology.

Read full bio

Keep exploring

Browse by channel

Explore all channels

Platform products

Try & buy

Featured

By industry

Featured

Topics

Articles

More to explore

For customers

For partners

About us

Open source

Company details

Recommendations

Select a language

Select a language

InstructLab tutorial: Installing and fine-tuning your first AI model (part 2)

Preparing your system for training

Training

Synthetic data generation (SDG)

Train

Putting it all together

Testing it

Moving forward

Serving it elsewhere

Some final tips

About the author

Rodrigo Freire

More like this

Keep exploring

Browse by channel

Products

Tools

Try, buy, & sell

Communicate

About Red Hat

Select a language

Red Hat legal and privacy links

Red Hat legal and privacy links