When people talk about artificial intelligence (AI), they’re usually talking about the combination of a chat bot, providing input and output, and a large language model (LLM), providing data that the chat bot can use to form sentences. AI without LLM isn’t very useful, and that’s why much of the conversation around the legalities and ethics of AI are concerned with what’s being used to build the “knowledge” used by generative AI (gen AI). How can you be sure that the data a gen AI uses to formulate its answers is reliable, trustworthy, and unencumbered by copyright? The best way to either audit or specialize the knowledge base of AI is to use open source, and that’s what the InstructLab project makes possible.
What is InstructLab?
InstructLab is an open source AI project that promotes universal modeling with open contribution. Its stated goal is to enable anyone to shape gen AI, whether you need an open source LLM due to concerns over intellectual property and copyright, privacy, reliability, subject matter expertise, accessibility or anything else. Designing a complete LLM is a big task, so the best way to build an open LLM is to build it in the open. Because InstructLab is open source, you can contribute to it and help make open source language models the best choice for gen AI. Here are three ways you can get started with InstructLab today.
Share your expertise
AI uses probability to construct its responses and it bases each answer on factual information serving as a model. The collection of facts used by AI is part of a LLM. For InstructLab to be the best basis of AI-powered content, it must provide an exhaustive LLM. Building an LLM requires the construction of a data bank of reliable content. In InstructLab terminology, this is called a taxonomy, which includes the two primary categories of skill and knowledge.
A skill in InstructLab is performative. When you create a skill for InstructLab, you teach it how to do something specific, like rearranging words in a sentence while maintaining the same meaning, finding two words that rhyme or converting a string to camel case.
Knowledge is a collection of facts, with citation of a reliable source. When you create knowledge for a language model, you provide the model data it can use to answer direct questions.
Both skill and knowledge are stored as yet another markup language (YAML), a minimalist file format consisting of key and value pairs (a “mapping”) and lists (a “sequence”). Here’s a simple example of knowledge expressed in YAML:
---
version: 2
created_by: tux
domain: flowers
seed_examples:
- answer: 'A carnation is a herbaceous perennial plant.'
question: 'What kind of plant is a carnation?'
- answer: 'Dianthus caryophyllus'
question: 'What is the scientific name for a carnation?'
task_description: 'teach a language model about carnations'
document:
repo: https://github.com/juliadenham/Summit_knowledge
commit: 195fc4d83a40d8a1b60062e66e06cfc0bc9c8d35
patterns:
- dianthus_caryophyllus.md
Here’s a simple example of a skill expressed as YAML:
---
version: 2
task_description: 'Teach the model how to rhyme.'
created_by: juliadenham
seed_examples:
- question: What are 5 words that rhyme with horn?
answer: warn, torn, born, thorn, and corn.
- question: What are 5 words that rhyme with cat?
answer: bat, gnat, rat, vat, and mat.
- question: What are 5 words that rhyme with poor?
answer: door, shore, core, bore, and tore.
- question: What are 5 words that rhyme with bank?
answer: tank, rank, prank, sank, and drank.
- question: What are 5 words that rhyme with bake?
answer: wake, lake, steak, make, and quake.
Compare the YAML examples of knowledge and skill. Knowledge contains verifiable data on a specific topic. A skill contains examples of a specific task.
After reading the contribution guide, you can create a qna.yaml
file of your own, and submit it to InstructLab for inclusion in the LLM. You may have to revise your work to ensure it can be processed and integrated into the project, and getting familiar with tools like yamllint is useful, but with just a little effort, you can make a meaningful contribution to open source AI.
Run an AI locally with the ilab command
Setting up an AI is a fairly complex and manual process, but with InstructLab it’s easier than you might expect. You need to be familiar with Python tools like virtual environments and pip, and you must be comfortable in a terminal environment such as Bash. You also must have CUDA (or a similar parallel computing framework) set up on your system, and plenty of drive space (the LLM is 5 GB, and growing).
Follow the install guide on the InstructLab repository, and then interact with AI and the InstructLab model, and then report on bugs and feature requests.
Contribute code
At the moment, the InstructLab project consists of 12 repositories. There’s the command-line interface ilab
, a Python library for synthetic data generation, design documents, taxonomy files and the JSON schema for the taxonomy YAML and more. If you’re a programmer, then you might find issues or feature requests in unclosed bug reports that you could help resolve.
For your first contribution, it often makes sense to solve a minor issue in anticipation that you’ll use the bulk of your time understanding the development team’s process. Bugs requiring only a simple fix are tagged with good first issue
, so use is:open is:issue label:"good first issue
" as a filter when looking for a good entry point. There’s also a guide for first-time contributors that explains in detail how to set up your dev environment and, just as importantly, how to test your new code before requesting a merge.
Open source AI is within reach, and as with any form of open source it stands to place the control and terms of AI into the hands of users. If you deal in a specialized domain, general AI may not have the knowledge or skill required to be useful to your users. If you deal with sensitive data, then general AI may not even have access to the information your users need. With InstructLab, you can help build a universal and open LLM, or even build your own. Whatever your goal, get started with InstructLab today!
About the author
Seth Kenlon is a Linux geek, open source enthusiast, free culture advocate, and tabletop gamer. Between gigs in the film industry and the tech industry (not necessarily exclusive of one another), he likes to design games and hack on code (also not necessarily exclusive of one another).
Browse by channel
Automation
The latest on IT automation for tech, teams, and environments
Artificial intelligence
Updates on the platforms that free customers to run AI workloads anywhere
Open hybrid cloud
Explore how we build a more flexible future with hybrid cloud
Security
The latest on how we reduce risks across environments and technologies
Edge computing
Updates on the platforms that simplify operations at the edge
Infrastructure
The latest on the world’s leading enterprise Linux platform
Applications
Inside our solutions to the toughest application challenges
Original shows
Entertaining stories from the makers and leaders in enterprise tech
Products
- Red Hat Enterprise Linux
- Red Hat OpenShift
- Red Hat Ansible Automation Platform
- Cloud services
- See all products
Tools
- Training and certification
- My account
- Customer support
- Developer resources
- Find a partner
- Red Hat Ecosystem Catalog
- Red Hat value calculator
- Documentation
Try, buy, & sell
Communicate
About Red Hat
We’re the world’s leading provider of enterprise open source solutions—including Linux, cloud, container, and Kubernetes. We deliver hardened solutions that make it easier for enterprises to work across platforms and environments, from the core datacenter to the network edge.
Select a language
Red Hat legal and privacy links
- About Red Hat
- Jobs
- Events
- Locations
- Contact Red Hat
- Red Hat Blog
- Diversity, equity, and inclusion
- Cool Stuff Store
- Red Hat Summit