With the release of RHEL AI 1.3, we’re excited to introduce context aware chunking powered by the Docling project, a significant enhancement that expands the capabilities of taxonomy contributions, pushes the limits of synthetic data generation and offers enhanced document support (pdf and md). Through our collaboration with IBM Research and adoption of Docling, RHEL AI 1.3 now features a new data ingestion pipeline.
This update enables seamless integration of PDF documents, marking a shift from the previous Markdown-only support in qna.yaml and a new chunking strategy for documents, context aware chunking which allows better representation of different document elements.
Fig 1: Docling Integration in RHEL AI brings improved chunking strategy and enhanced document support
Fig 2: Zoomed in view of data ingestion pipeline using Docling
What’s new?
PDF support
Contributors can now reference PDF documents directly in taxonomy submissions alongside Markdown files. This update eliminates the need to manually convert PDFs to Markdown, streamlining the contribution process.
With the support of pdf documents, end users can bring their personal/enterprise documents directly for model customization for their use cases. In RHEL AI 1.4, more document types such as word, pptx, docx, html will be supported allowing users to support a broad range of use cases.
Docling adoption/context aware chunking
With the adoption of Docling, the #1 open source document paper on GitHub, we are introducing a new context aware chunking capability. It intelligently recognizes and processes different document elements – from text and tables to figures, lists and columns. This means more accurate extraction and better understanding of your documents' structure and meaning. This is an improvement over naive chunking being used in RHEL AI 1.2.
We've also enhanced our synthetic data generation (SDG) pipeline to leverage these new capabilities. We are continuing our collaboration with IBM Research to push the boundaries of context-aware document processing even further in future releases.
Docling parses PDFs and converts them into structured, context-aware chunks. The tool accurately represents critical semantic elements, including text, tables and images and enhances contextual understanding for better synthetic data generation.
Why this matters
The addition of PDF support overcomes the limitations of Markdown-only workflows, enabling contributors to include richer, more detailed documents in their submissions. Docling’s robust chunking capabilities ensure that PDFs are no longer a barrier to streamlined knowledge integration, making taxonomy contributions faster, easier, and more effective.
Naive chunking strategies often result in poor outputs for synthetic data generation and thereby finetuning of language models. Context aware chunking can result in reduced hallucinations involving complex document structures. This can facilitate seamless integration across various departments within an organization, each handling complex document representations. Another capability we are working on is hierarchical context aware chunking that captures additional meta data such as headings/captions for better context.
product trial
Red Hat Enterprise Linux AI | Versión de prueba del producto
Sobre los autores
Aditi is a Technical Product Manager at Red Hat, working on Instruct Lab’s synthetic data generation capabilities. She is passionate about leveraging generative AI to create seamless, impactful end user experiences.
Aakanksha Duggal is a Senior Data Scientist at Red Hat, leading synthetic data generation efforts on Instructlab. Her work focuses on advancing scalable and impactful technologies in the field of AI.
Más similar
Navegar por canal
Automatización
Las últimas novedades en la automatización de la TI para los equipos, la tecnología y los entornos
Inteligencia artificial
Descubra las actualizaciones en las plataformas que permiten a los clientes ejecutar cargas de trabajo de inteligecia artificial en cualquier lugar
Nube híbrida abierta
Vea como construimos un futuro flexible con la nube híbrida
Seguridad
Vea las últimas novedades sobre cómo reducimos los riesgos en entornos y tecnologías
Edge computing
Conozca las actualizaciones en las plataformas que simplifican las operaciones en el edge
Infraestructura
Vea las últimas novedades sobre la plataforma Linux empresarial líder en el mundo
Aplicaciones
Conozca nuestras soluciones para abordar los desafíos más complejos de las aplicaciones
Programas originales
Vea historias divertidas de creadores y líderes en tecnología empresarial
Productos
- Red Hat Enterprise Linux
- Red Hat OpenShift
- Red Hat Ansible Automation Platform
- Servicios de nube
- Ver todos los productos
Herramientas
- Training y Certificación
- Mi cuenta
- Soporte al cliente
- Recursos para desarrolladores
- Busque un partner
- Red Hat Ecosystem Catalog
- Calculador de valor Red Hat
- Documentación
Realice pruebas, compras y ventas
Comunicarse
- Comuníquese con la oficina de ventas
- Comuníquese con el servicio al cliente
- Comuníquese con Red Hat Training
- Redes sociales
Acerca de Red Hat
Somos el proveedor líder a nivel mundial de soluciones empresariales de código abierto, incluyendo Linux, cloud, contenedores y Kubernetes. Ofrecemos soluciones reforzadas, las cuales permiten que las empresas trabajen en distintas plataformas y entornos con facilidad, desde el centro de datos principal hasta el extremo de la red.
Seleccionar idioma
Red Hat legal and privacy links
- Acerca de Red Hat
- Oportunidades de empleo
- Eventos
- Sedes
- Póngase en contacto con Red Hat
- Blog de Red Hat
- Diversidad, igualdad e inclusión
- Cool Stuff Store
- Red Hat Summit