Congratulations to Docling for climbing the charts to be one of GitHub’s top repos of the month!
Organizations have lots of data, from their intellectual property and knowledge to marketing, sales and customer data to operations policies and procedures and much more. The challenge isn’t collecting the data or generating documents; it’s extracting meaningful insights from it. Actionable insights can lead to better customer service, faster time-to-customer value, smoother operations and so. Generative AI (gen AI) promises to bridge this gap, turning the mountain of organizational data into strategic insights.
The promises of gen AI cannot be realized, however, when the data is trapped in formats not directly consumable by a large language model (LLM). This could be when the data is in a pdf file or proprietary document format like docx, turning this challenge into a blocker for aligning the model to the organization’s needs. To solve those challenges, organizations need to preprocess existing documents into formats that can be consumed by fine-tuning pipelines or for the creation of retrieval-augmented generation (RAG).
This is not about a naive OCR or text extraction. An important part of this preprocessing stage is that the data needs to be extracted following context and element-aware techniques. For example, if a table spans multiple pages, it must be extracted as a single table, or if the document has a layout of multiple text columns per page or mixes elements like images and tables, each one of those elements must be extracted consistently and maintain the awareness of the context from where it is being extracted.
Throughout the years, many open source tools have tried to solve one or even a few aspects of this challenge, but none tackle the whole problem. This fragmented approach has forced organizations to use complex pipelines, interconnect disjointed tools and process the same documents multiple times using different tools. The results are inconsistent, computationally expensive and complex to maintain, and their output quality varies.
We faced these same challenges when creating document ingestion pipelines for processing documents to fine-tune a model with InstructLab. That is when we discovered and adopted Docling, a project developed at IBM Research, which proved transformative for our workflow. Since IBM recently released Docling as an open source project in Fall 2024, this tool has become prominent in the NLP field, validating our early adoption. We are thrilled to witness its growing impact in the open source gen AI ecosystem.
Docling at a glance
Docling is an upstream open source project and tool for parsing documents, from .pdf and .docx to .pptx and html and more, and converting them into formats like Markdown or JSON, making it easier to prepare content for gen AI applications. It supports advanced PDF processing optical character recognition (OCR) for scanned documents and integrates with tools like LlamaIndex and LangChain for RAG and question-answering tasks.
Source: https://github.com/DS4SD/docling?tab=readme-ov-file
Try out Docling for yourself
Simply install Docling from your package manager, e.g. pip:
# pip install docling
Once installed, you can do the document conversion programmatically in your Python module or use the Docling cli.
from docling.document_converter import DocumentConverter
source = "https://arxiv.org/pdf/2408.09869" # PDF path or URL
converter = DocumentConverter()
result = converter.convert(source)
print(result.document.export_to_markdown()) # output: "### Docling Technical Report[...]"
# docling https://arxiv.org/pdf/2206.01062
You can find other ways to configure this data conversion and details in the project repository here.
Docling in action today
Docling is already being used by the InstructLab community for users to submit public datasets to the Instructab taxonomy. Users can go to https://ui.instructlab.ai/ and submit their datasets for inclusion in future versions of the open source InstructLab granite-lab community model. This granite-lab model is based on the open source Granite-7b base model, which has been customized with InstructLab to add new knowledge and skills submitted by community members. The upcoming granite-lab release uses the open source Granite-3.0-8b base model and is being customized with InstructLab. This service uses Docling to convert users' documents into the data format needed for the Instructlab taxonomy, simplifying the process of contributing new knowledge documents. These same tools can also be used to customize a user’s private model instances.
Red Hat and Docling
Red Hat introduced Red Hat Enterprise Linux AI (RHEL AI) and InstructLab earlier this year to bring these same capabilities to enterprise customers. RHEL AI is an enterprise-focused foundation model platform that integrates open source gen AI capabilities and is optimized for deployment across hybrid cloud environments. It combines IBM’s open source Granite LLMs with Red Hat’s InstructLab tools, enabling domain experts (not just data scientists) to fine-tune models with industry-specific knowledge, aligning them with unique organizational needs and data. Those models can then be deployed and managed across a hybrid cloud environment that spans from enterprise data centers, to public clouds and edge environments.
We embraced IBM Research’s decision to open source Docling, and their continued commitment to innovating in this field, making gen AI more accessible and achievable. In upcoming RHEL AI releases, we intend to include Docling as a supported feature, giving customers a simpler way to ingest their own private enterprise data in various formats and presenting that to InstructLab synthetic data generation and phase training in RHEL AI to customize and tune their own model instances.
We are excited by the innovation and user benefits that Docling brings to the open-source community and look forward to bringing these capabilities to RHEL AI users. The speed at which open source projects are making gen AI more accessible is nothing short of amazing, and we’re pleased to both support these upstream communities and bring this innovation to our customers as enterprise-ready capabilities.
About the author
William is a Product Manager in Red Hat's AI Business Unit and is a seasoned professional and inventor at the forefront of artificial intelligence. With expertise spanning high-performance computing, enterprise platforms, data science, and machine learning, William has a track record of introducing cutting-edge technologies across diverse markets. He now leverages this comprehensive background to drive innovative solutions in generative AI, addressing complex customer challenges in this emerging field. Beyond his professional role, William volunteers as a mentor to social entrepreneurs, guiding them in developing responsible AI-enabled products and services. He is also an active participant in the Cloud Native Computing Foundation (CNCF) community, contributing to the advancement of cloud native technologies.
Browse by channel
Automation
The latest on IT automation for tech, teams, and environments
Artificial intelligence
Updates on the platforms that free customers to run AI workloads anywhere
Open hybrid cloud
Explore how we build a more flexible future with hybrid cloud
Security
The latest on how we reduce risks across environments and technologies
Edge computing
Updates on the platforms that simplify operations at the edge
Infrastructure
The latest on the world’s leading enterprise Linux platform
Applications
Inside our solutions to the toughest application challenges
Original shows
Entertaining stories from the makers and leaders in enterprise tech
Products
- Red Hat Enterprise Linux
- Red Hat OpenShift
- Red Hat Ansible Automation Platform
- Cloud services
- See all products
Tools
- Training and certification
- My account
- Customer support
- Developer resources
- Find a partner
- Red Hat Ecosystem Catalog
- Red Hat value calculator
- Documentation
Try, buy, & sell
Communicate
About Red Hat
We’re the world’s leading provider of enterprise open source solutions—including Linux, cloud, container, and Kubernetes. We deliver hardened solutions that make it easier for enterprises to work across platforms and environments, from the core datacenter to the network edge.
Select a language
Red Hat legal and privacy links
- About Red Hat
- Jobs
- Events
- Locations
- Contact Red Hat
- Red Hat Blog
- Diversity, equity, and inclusion
- Cool Stuff Store
- Red Hat Summit