Abonnez-vous au flux
AI/ML 

While many AI model developers publicly release research papers and their data training approaches, we’ll focus on one model in particular– IBM’s Granite model, where IBM has gone one step further and released their specific training data. So, if you would like specificity on what Granite family large language models (LLM) is trained on, this article provides a detailed breakdown of the datasets used in the initial training phase of IBM’s popular granite.13b.v1 model, the original Granite model from which other model variants were fine-tuned to target downstream tasks.

What are the IBM Granite models?

As we begin to see the impact of AI in our lives and organizations, principles such as trust are as important to our software as they are to AI/ML models. Thus, IBM Research built and trained the Granite family of models with transparency under an Apache 2.0 license for broad, unencumbered commercial use. “The Granite family of models provides enterprise users with some of the most robust and transparent insights into the underlying training data, important for efficiently refining model behavior for specific use cases and domains, and for protecting enterprises from risk from any unlicensed content in the training data”, as reported by The Forrester Wave™: AI Foundation Models For Language Q2 2024.

What data was used to train the Granite models?

Granite.13b.v1 was trained on a massive dataset consisting of 1 trillion tokens derived from 14 distinct datasets across various domains. Due to the transparency in training data, we’re able to detail the data sources used to teach the model to handle sentiment classification, named entity recognition, question answering and summarization. These are considered to be enterprise safe data sources, and Granite models are among the most transparent according to Stanford University’s Foundation Model Transparency Index 2024. Let’s break these down into several categories.

Academia and science

  • arXiv: This dataset includes over 1.8 million scientific pre-prints
  • DeepMind Mathematics: This dataset contains pairs of mathematical questions and their corresponding answers
  • Pubmed Central: This dataset comprises biomedical and life sciences research papers

Legal and financial

  • Free Law: This dataset encompasses public-domain legal opinions from both US federal and state courts
  • SEC Filings: This dataset contains 10-K/Q filings from the US Securities and Exchange Commission (SEC) spanning from 1934 to 2022
  • United States Patent and Trademark Office: This dataset includes US patents granted between 1975 and May 2023, excluding design patents

Code and technology

  • GitHub Clean: This dataset features code from CodeParrot in various programming languages
  • Hacker News: This dataset comprises news articles focused on computer science and entrepreneurship, collected between 2007 and 2018

General web and literature

  • Common Crawl: This dataset is an open repository of web crawl data
  • OpenWeb Text: This is an open source version of OpenAI's Web Text corpus containing web pages up to 2019
  • Project Gutenberg (PG-19): This dataset includes free e-books, primarily older works with expired US copyrights

Other

  • Stack Exchange: This dataset features anonymized user-contributed content from the Stack Exchange network, a collection of websites focused on questions and answers
  • Webhose: This dataset includes unstructured web content transformed into machine-readable data feeds, acquired by IBM
  • Wikimedia: This dataset contains extracted plain text from pages and articles across eight English Wikimedia projects (enwiki, enwikibooks, enwikinews, enwikiquote, enwikisource, enwikiversity, enwikivoyage, enwiktionary)

The Granite 13b model is the base model from which all other variants of Granite were fine-tuned for  specific tasks. However, version 2 of the 13b model, granite.13b.v2, had an additional pretraining on 1.5T new tokens that were deemed usable after going through the data processing pipeline seen below. Upon adding these tokens to version one of the model we now have 2.5T Tokens used in the training of version 2. Version 2 still contains the same 14 datasets as version 1, plus 6 new data sets.

Funnel demonstrating the filtering of extracted data, beginning with 28.7 Terabytes of data and finishing with 2.5 Trillion Tokens usable for training.

V2 additional pre-training data:

  • Earnings Call Transcripts: This dataset includes transcripts from quarterly earnings companies hold with investors
  • EDGAR Filings: Annual reports from all the publicly traded companies in the US spanning a period of more than 25 years
  • FDIC: The data is from the annual submissions of the Federal Deposit Insurance Corporation (FDIC)
  • Finance Text Books: A corpus from University of Minnesota's Open Textbook Library, including all textbooks tagged as finance
  • Financial Research Papers: Publicly available financial research paper corpus
  • IBM Documentation: IBM redbooks and product documents

As with any form of software, having trust and confidence in our workloads is critical to enterprise readiness. As AI is another tool being used to enhance our applications and streamline business processes, we should treat it as such and work to apply the same open source principles and transparency that have been tested over the years to AI itself.

Red Hat’s history as a leader in the open source community has led to RHEL AI, a supported platform for training and deploying Granite models for enterprise applications. However, as this industry continues to advance, we should strive for openness as a whole, from research papers detailing architecture advancements, to permissive licensing for encouraging widespread adoption, and finally the transparency behind training data itself. What history has demonstrated is that when work and collaboration is done in the open, everybody benefits.

Learn more about RHEL AI


À propos des auteurs

Legare Kerrison is an intern on the developer advocacy team, focusing on providing developers with resources for Red Hat products, with an emphasis on Podman and Instructlab.

Read full bio

Cedric Clyburn (@cedricclyburn), Senior Developer Advocate at Red Hat, is an enthusiastic software technologist with a background in Kubernetes, DevOps, and container tools. He has experience speaking and organizing conferences including DevNexus, WeAreDevelopers, The Linux Foundation, KCD NYC, and more. Cedric loves all things open-source, and works to make developer's lives easier! Based out of New York.

Read full bio
UI_Icon-Red_Hat-Close-A-Black-RGB

Parcourir par canal

automation icon

Automatisation

Les dernières nouveautés en matière d'automatisation informatique pour les technologies, les équipes et les environnements

AI icon

Intelligence artificielle

Actualité sur les plateformes qui permettent aux clients d'exécuter des charges de travail d'IA sur tout type d'environnement

open hybrid cloud icon

Cloud hybride ouvert

Découvrez comment créer un avenir flexible grâce au cloud hybride

security icon

Sécurité

Les dernières actualités sur la façon dont nous réduisons les risques dans tous les environnements et technologies

edge icon

Edge computing

Actualité sur les plateformes qui simplifient les opérations en périphérie

Infrastructure icon

Infrastructure

Les dernières nouveautés sur la plateforme Linux d'entreprise leader au monde

application development icon

Applications

À l’intérieur de nos solutions aux défis d’application les plus difficiles

Original series icon

Programmes originaux

Histoires passionnantes de créateurs et de leaders de technologies d'entreprise