Subscribe to the feed
AI/ML 

After shifting my career from a product security program manager to chief architect, it was time for me to dip my toes into artificial intelligence (AI)—until that moment, I was pretty much a prompt operator.

Why train my own models? Sometimes you have confidential, regulated or restricted information that can’t be uploaded to other third-party platforms (well, your data might end up training their model). Or, you might want to have tighter control over various aspects of your model.

Think of a provider uploading your data into an external platform—there are lots of risks involved, like leaks or your data end up training someone else’s model. Or, another scenario, someone else’s data interfering with your trained data. There’s a quote that I heard from my manager that I just love, “Bring AI to your data instead of giving your data to AI.” Dang, that’s powerful.

I’m a strong believer that a collection of specialist and purpose-built models, instead of one big monolithic model, is the way forward that will help companies obtain the best of AI technologies.

In this article, I will use InstructLab and Red Hat Enterprise Linux (RHEL) AI to train the IBM Granite large language model (LLM) 7-B. This base model provides a balance of accuracy and size, making a good place for us to start.

Installing InstructLab

Since some systems run on root (like RHEL AI 1.1) and others don’t, let’s set an alias so these instructions work on all scenarios:

echo export ILAB_PATH=$HOME/.local/share/instructlab >> ~/.bash_profile
echo export ILAB_PATH=$HOME/.local/share/instructlab >> ~/.bashrc
source ~/.bash_profile

 If you’re using RHEL AI, congratulations! InstructLab is already installed. Skip this section and move to the “Set up” section. Otherwise, keep reading.

Installing InstructLab is straightforward and all it needs is Python—but there’s one thing to note. As of the time of writing, the latest InstructLab, - version 0.20, requires at least Python version 3.10, and not greater than 3.12, meaning Python 3.12.1 is out of scope. Otherwise, you will end with an old (0.17) InstructLab version. Verify that you have a suitable version by running python --version. My suggestion: Use python 3.11. Then, install it using the Python package installer (pip):

cd ~
python -m pip install --upgrade pip
python -m venv venv
source venv/bin/activate
pip3 install instructlab

 This will kick off a significant download, around 3-4 GB of data.

Set up

If you are running RHEL AI, ensure that you are elevated to root:

sudo su -

Check your InstructLab version:

ilab --version

Now, initialize InstructLab configuration:

ilab config init

RHEL AI should automatically detect the existing hardware. If asked questions, stick to the default answers and choose the profile that most closely matches your GPU configuration. If you don’t have any, choose zero.

Setting up models in RHEL AI

Now, if you are running RHEL AI, you can take advantage of the Red Hat registry to fetch the models. If you aren’t, move to the next section.

Access the Red Hat Customer Portal and, if you don’t have a registry service account, create a new one.

Click the new account, and take note of the username and the token. Save these somewhere, as you won’t be able to see it again.

podman login registry.redhat.io

 Use the generated username and token as the password.

Download the Granite 7B-lab starter model:

ilab model download --repository docker://registry.redhat.io/rhelai1/granite-7b-redhat-lab --release latest

Verify that you have successfully downloaded the Granite model:

ilab model list
+-----------------------------------+---------------------+---------+
| Model Name                        | Last Modified       | Size    |
+-----------------------------------+---------------------+---------+
| models/granite-7b-redhat-lab      | 2024-11-01 18:35:22 | 12.6 GB |
+-----------------------------------+---------------------+---------+

Setting up models if you don't have RHEL AI

Fetch the base model from Hugging Face:

ilab model download --repository instructlab/granite-7b-lab-GGUF --filename granite-7b-lab-Q4_K_M.gguf

Verify that you have successfully downloaded the Granite model:

ilab model list
+------------------------------+---------------------+--------+
| Model Name                   | Last Modified       | Size   |
+------------------------------+---------------------+--------+
| granite-7b-lab-Q4_K_M.gguf   | 2024-07-10 19:01:41 | 3.8 GB |
+------------------------------+---------------------+--------+

Trying it

At this point, you are able to chat with your model. You will need two different sessions, one for serving the model and the other for chatting with the model.

Begin by starting the model server:

ilab model serve --model-path $HOME/.cache/instructlab/models/<model name>

 Replace the <model name> with the name of the model that returned in ilab model list.

Note that it will take a while to start up for the first time. Only continue after you see the following message:

INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://127.0.0.1:8000 (Press CTRL+C to quit)

At this point, move to the other session and start the chat interface. Open the new session. If using RHEL AI, elevate to root:

sudo su -

If you are using upstream InstructLab, start venv:

source venv/bin/activate

Now start the chat session:

ilab model chat
╭─────────────────────────────────── system ─────────────────────────────╮
│ Welcome to InstructLab Chat w/ GRANITE-7B-STARTER (type /h for help)               │
╰───────────────────────────────────────────────────────────────────────╯
>>>                                                                 [S][default]

Congratulations! Poke around, explore and find out the strengths and weaknesses of the model.

Now let’s prepare our knowledge for training.

Preparing your knowledge

The data requirements for InstructLab are spartan. All you need is your knowledge structured in Markdown format and a YAML file containing seed questions and how to fetch the Markdown files. Let’s get to each of these.

In this example, we will be teaching the model some facts about Corinthians (my soccer team) and Palmeiras (my team’s nemesis). For the sake of fun, there’s some bias in the knowledge file leaning towards Corinthians.

Main knowledge file

This is where you put the gross of the information that will train your model. The Markdown knowledge file is what is really influential for model learning.

A word of advice. If you are going to train the model using demo-grade quick training (minutes) instead of enterprise-grade multi-stage training (days)—particularly if you're comparing things (in my case, comparing two soccer teams)—you should try to make sure you have some symmetry in your data. For example, if you're mentioning three key players for one team, you should also provide three players for the other team. Are you mentioning one team's stadium? Provide the other team's stadium as well. If you don't do this, the model will tend to lean towards the dataset with more information, giving it a stronger signal all around.

This is also true if you only provide facts about a single team—every other piece of related data, including the built-in valid knowledge in the model, will tend to lean towards the data that you have provided. As an example, if I provide data only about S.C. Corinthians? Dang, the Corinthians facts will end up in data about Palmeiras, Santos, Flamengo, Chelsea, Bayern Munich, Boca Juniors, etc. This is a known model trait. You can learn more about traits and hallucinations in this article.

You'll also discover that random questions unrelated to soccer will be strongly biased towards the data that you have loaded into the model. This is by design. Remember: you are building a purpose-built and specialist model, so everything in that model will gravitate around the data that you provided.

You should also steer clear of what might confuse your model. For example, in São Paulo the city, there’s also São Paulo the soccer team. That caused all kinds of shenanigans in my model and I had to remove the São Paulo (the soccer team) facts to avoid the model tripping on it. You can see my content journey in the amount of commits here.

A full, multi-stage, enterprise-grade training would probably address all those shortcomings, but that can take a long time and has associated costs.

A word of caution here: Don’t make your initial data set for training very large—start with baby steps and improve it incrementally. The larger your initial data set is, the more time you'll have to invest in the training phase—which will also make troubleshooting more difficult. This is what my training data looks like.

Finally, don’t forget to upload to Github (or somewhere else) the latest version of the knowledge file. After committing it, update the commit ID in qna.yaml to match the current version of the Markdown file.

qna.yaml

This file provides the seed questions and where InstructLab should be fetching the main Markdown knowledge. It must be formatted using QNA Version 3. It has very specific requirements:

  • It must have exactly 5 different contexts
  • For each context, it must have exactly 3 question/answer pairs
  • Each line is limited to 120 characters. It is fine to wrap it in a new line
  • Do not leave a trailing space at the end of the line
  • Avoid diacritics (like á, ç, ñ, etc.) Learn more here
  • Be sure to have symmetry between questions and answers and use the same words in the context in the QNA

qna.yaml is the only file that you provide directly to InstructLab. InstructLab will then:

  • Parse the file
  • Look up the content declared on the document clause
  • Fetch the document specified at the patterns section
    • From the specified repository at the repo section
    • At the commit level specified at the commit section

I also found that the Context section must mimic a fragment from the main Markdown knowledge file.

Red Hat has helpfully created a UI to generate the QNA YAML file! Check it out here.

If you want to do this using the command line interface, not a problem: get a blank file here and a sample one here.

Before you get started, wipe out any existing QNA files. These will only steal time in your training phase. Run:

find $ILAB_PATH/taxonomy -type f | grep qna.yaml | xargs rm

Now check out the content that we will be using to train the model. Create the directory where we will be loading the data:

mkdir -p $ILAB_PATH/taxonomy/knowledge/sports

Download the QNA file that will serve as the foundation for this training:

curl -o $ILAB_PATH/taxonomy/knowledge/sports/qna.yaml \
https://raw.githubusercontent.com/rfrht/instructlab-demo/refs/heads/main/qna-soccer-sp.yaml

Verify that the file syntax is sound by running:

ilab taxonomy diff

If your syntax is valid, the result will be:

knowledge/sports/qna.yaml
Taxonomy in $ILAB_PATH/taxonomy is valid :)

In the next article, we will be finally training the model with Corinthians and Palmeiras facts. Obviously, leaning towards my team. Stay tuned!


About the author

Rodrigo is a tenured professional with a distinguished track record of success and experience in several industries, especially high performance and mission critical environments in FSI. A negotiator at his heart, throughout his 20+ year career, he has leveraged his deep technical background and strong soft skills to deliver exceptional results for his clients and organizations - often ending in long-standing relationships as a trusted advisor. Currently, Rodrigo is deep diving on AI technology.

Read full bio
UI_Icon-Red_Hat-Close-A-Black-RGB

Browse by channel

automation icon

Automation

The latest on IT automation for tech, teams, and environments

AI icon

Artificial intelligence

Updates on the platforms that free customers to run AI workloads anywhere

open hybrid cloud icon

Open hybrid cloud

Explore how we build a more flexible future with hybrid cloud

security icon

Security

The latest on how we reduce risks across environments and technologies

edge icon

Edge computing

Updates on the platforms that simplify operations at the edge

Infrastructure icon

Infrastructure

The latest on the world’s leading enterprise Linux platform

application development icon

Applications

Inside our solutions to the toughest application challenges

Original series icon

Original shows

Entertaining stories from the makers and leaders in enterprise tech