How to use YAML with InstructLab

7 de outubro de 2024Seth Kenlon4 minutos (tempo de leitura)

Training a large language model (LLM) probably sounds like a specialized and highly technical task. Until recently, that was true, but the InstructLab project has been putting in the work required to make it easier for anyone to train an LLM. That means you can contribute to the development of artificial intelligence (AI), either because your organization needs domain-specific knowledge for its AI solution, or just because you want to help improve open source AI. All you need to know is how to type text in a simple format called YAML, and this article is going to teach you exactly how to do that.

Open a text editor

An LLM establishes probable responses to questions by analyzing existing content about a specific topic. The easiest way to contribute to an LLM is to contribute knowledge about a topic, in the form of questions and answers. All you need for that is a text editor.

A text editor is like a simplified word processor. There are lots of them out there, including Notepad++ and Notepadqq, Pulsar and VSCodium , so just choose the one that works for you.

Download the template

The InstructLab provides a template file for knowledge content so you don’t have to start from a blank file. To download the template, open a web browser to github.com/instructlab/taxonomy and then click the green Code button in the top right corner and select Download ZIP.

Click the Code button and select Download ZIP

Once the files have downloaded, find the ZIP archive in your Downloads folder and unzip it. This produces a new folder called taxonomy.

Write simplified YAML

In the docs folder within the taxonomy directory, open the file named template_qna.yaml in your text editor. This file contains a blank question-and-answer session you can use as a template for the knowledge you want to provide training for.

YAML is designed to be simple, but the amount of YAML you need for this is even simpler. Mostly, YAML is a collection of labels (also called a “key” or “mapping”) and descriptions (also called a “value”), which is how a lot of data on the internet is structured. When you go to your favorite online store, you probably shop by clicking on a label (the name of an item) and then you read its description. When you write a report for work or school, you probably write a subheading, and then you write a paragraph explaining more about that subheading. InstructLab’s use of YAML is based on the exact same concept.

Here’s an abbreviated sample of the blank YAML template:

version: 3 
domain: <The knowledge domain> 
created_by: <Your name> 
seed_examples: 
  - context: | 
    <Context from the document associated with this set of sample q&a pairs.> 
    questions_and_answers: 
      - question: | 
        <A relevant question used for synthetic data generation.> 
        answer: | 
          <The desired response for the question.>

The data at the top of the document establishes the knowledge domain you’re writing about, and who you are. If you’re contributing to the InstructLab project, then you must use your GitHub user name as the description of the created_by label. If you’re contributing to a private LLM, then you can use your name or whatever description the project manager has requested.

The seed_examples is the main label for the knowledge section you’re about to create. It doesn’t require a description, because it contains yet more labels.

The context label is essentially a subheading, and it requires a statement from you that describes the kind of conversation that might lead to the questions and answers you’re about to enter. For example, to add a question and answer session about some aspect of the ancient Ptolemaic empire, you might describe its context as “The kings and queens of the Ptolemaic empire.” To enter questions and answers about the works of Edgar Rice Burroughs, you might write “The literature of Edgar Rice Burroughs.” Just imagine you’re writing a report for school. It’s the same logic.

Indentation is important

YAML is a sequence of label after label, so it relies on indentation to represent the flow of logic. In a word processor, a heading is often displayed as large and bold text compared to the text in a paragraph. Instead of using font size and style, YAML uses indentation.

When you write a description in InstructLab’s YAML file, you write it on the line under the label, and you add two spaces to the level of indentation. This is how the template is structured, so it’s a pretty easy pattern to fall into.

Questions and answers

Next is the actual question and answer section. Under each question heading, you write exactly one question that you might anticipate in a conversation about your chosen topic. Under each answer heading, you write a simple answer to that question.

It’s best to keep both the questions and answers short and concise, because that ensures that they’re modular and distinct. Don’t try to sneak two questions into one, especially when the answer to one question has no bearing on the answer to the second. It’s misleading to ask “Did Edgar Rice Burroughs write the Tarzan book and movie?” as one question, because Edgar Rice Burroughs wrote the Tarzan books whether or not he wrote a Tarzan screenplay.

Write a distinct question and a focused answer so that the LLM can use your knowledge to extrapolate correct data. Here’s an example:

version: 3 
domain: Ptolemaic empire 
created_by: Tux 
seed_examples: 
  - context: | 
    Discussion of Cleopatra. 
    questions_and_answers: 
      - question: | 
          How many Ptolemaic queens were named Cleopatra? 
      - answer: | 
          There were 7 Ptolemaic queens named Cleopatra.

YAML for InstructLab

YAML is a way of writing text so that it has predictable structure, which makes it easy for computers to process. Follow the InstructLab template, add your knowledge to the LLM of your choice, and help improve AI. If you need reinforcement for what you’ve learned from this article, check out this video introduction on how to get started!

Sobre o autor

Seth Kenlon

Linux geek

Seth Kenlon is a Linux geek, open source enthusiast, free culture advocate, and tabletop gamer. Between gigs in the film industry and the tech industry (not necessarily exclusive of one another), he likes to design games and hack on code (also not necessarily exclusive of one another).

Read full bio