Issue #16 February 2006

Introduction to DocBook XML

by Paul W. Frields


Introduction

For any open source project with a significant user population, documentation is vital to success. Documentation helps users quickly understand how the project's code works, comprehend how it is applicable to their needs, and put it to use. Some open source projects, typically the larger and more popular ones, are fortunate to have one or more documentation writers and editors. The Fedora™ Project, for example, has the dedicated Fedora Documentation Project. Other projects may not have such dedicated resources, and programmers have to spend part of their time on documentation. Without this documentation, there tends to be an extremely high barrier to entry for users, or even experienced developers, depending on the nature and complexity of a project. Most authors and developers therefore want to know:

  • how to efficiently write documentation
  • how to make this documentation available in a variety of formats
  • how to exchange document information with other code or content

This is the first in a series of articles which discuss DocBook XML, a popular and highly-vaunted markup language for producing single-source documentation. The series also touches on the use of tools to query your documents for content information. These are all subjects in which I've become involved, or maybe entangled, as part of my work with the Fedora Documentation Project. Hopefully you'll find this information useful and interesting, whether you are new to documentation or XML in general. Perhaps it will even be a stepping stone for you to join our project!

Markup

Many people author content without knowing anything about the markup they are using. Most users of proprietary word processors fit into this category. You can think of markup as a kind of formatting which surrounds the text content, and provides instructions on how to interpret it. Markup is not a new idea; Structured General Markup Language, or SGML, has actually been on the scene for about 40 years and was invented to ensure data interchangeability for decades. Extensible Markup Language, or XML, is a younger subset of SGML, simpler in some ways and, therefore, far more popular.

XML, for example, declares that the angle brackets "< >" should be used to delimit, or set off, markup, whereas SGML allows you to customize this feature (among many others). On the whole, though, these markup languages don't define many particular markup semantics, but rather the grammar of how to create a specific markup language. Other markup languages, such as Hypertext Markup Language (HTML), declare and define specific semantic language constructs, such as BODY, P, and META elements.

Programmers can write applications such as web browsers directly against these language standards to format or render the content and present it to the user. Formatting or content may be further altered by the application of stylesheets.

DocBook and DTDs

DocBook is a markup language originally created using SGML and designed for technical documentation. It has been "ported" to XML, however, and can be used for almost any kind of documentation. Most people refer to DocBook as a Document Type Definition (DTD).

A good DTD is helpful because it makes a file easy to understand for both people and automatic parsers when reading the raw XML. A DTD describes the particular and specific markup that is allowed in an XML document that uses that DTD. This makes DocBook more analogous to HTML, because it declares the semantics of markup, such as permissible element names.

By identifying an XML file as subscribing to this DTD, you label it as a DocBook file to parsers. Strictly speaking, just about anything that can understand and retrieve information from XML is a parser, including you! A good DTD is helpful because they make a file easy to understand for both people and automatic parsers when reading the raw XML. Take the following snippet, for instance:

<?xml version="1.0" standalone="no"?>
<!DOCTYPE article PUBLIC "-//OASIS//DTD DocBook XML V4.2//EN"
"http://www.oasis-open.org/docbook/xml/4.2/docbookx.dtd">

<article lang="en">
  <title>Docs Project Completes Work, Hits Road</title>
  <articleinfo>
    <author>
      <firstname>Karsten</firstname>
      <surname>Wade</surname>
    </author>
  </articleinfo>
  <section id="sn-intro">
    <title>Introduction</title>
    <para>
      This week, the Fedora Documentation Project
      completed all documentation for Fedora Core, and
      subsequently everyone went to Tahiti for a week.
      For more information, or to have tasty beverages
      delivered to our bungalow, refer to <ulink
        url="http://fedoraproject.org/wiki/Docs/Tahiti"/>.
    </para>
  </section>
</article>

If you've ever used the "View Source" option in your Web browser, this type of markup grammar probably looks somewhat familiar. The angle brackets "< >" surround tags, and those tags are used to mark, or delimit, the beginning and end of elements in the XML document. Some tags are self-closing, meaning the element is defined within the tag itself. You can think of these elements as nodes in a hierarchical tree of information. If the tree is properly organized, and open elements do not overlap, the document is called well-formed.

Although you may not have seen these particular elements before, such as article and ulink, it is not difficult to guess what objects they represent in the context of this document. The strict organization makes XML ideal for storage, retrieval, and exchange of information between applications. It also makes XML easy to parse manually, which is helpful when you are trying to read XML which has an unknown DTD.

The DTD sets out all the rules for marking up this particular kind of XML document. Note the DOCTYPE declaration at the top of the document, which indicates the DTD this document uses, and what element it uses for its root, or topmost, element. There are many guides available online to teach you how to read and write DTDs, such as the XML Guide, available at http://xmlwriter.net/resources/xml_guide.shtml. When you use a DTD, you also make it possible for XML tools to do a lot of "heavy lifting" on your behalf, such as validation. Validation is an important part of XML work; a document is valid when it follows the rules set out in the DTD for its DOCTYPE.

The DTD for DocBook is fully described in DocBook: The Definitive Guide (TDG), available online at http://www.docbook.org/ in a variety of formats, or at a computer bookstore near you. This book contains everything you might want to know about using the DocBook DTD to create all kinds of documentation. I keep an HTML copy on my laptop hard disk for reference when writing. The latest version of the book online covers the DocBook 5.0 schema, which is still in beta release. If you are looking for documentation for an earlier schema such as 4.4 or 4.2, you can find the DocBook source at http://cvs.sourceforge.net/viewcvs.py/docbook/defguide/.

XML Editors

You can choose from a huge variety of XML-savvy editors regardless of your environment preferences. If you are a graphically-oriented, point and click user, you could try:

  • Conglomerate or MlView, dedicated GNOME-based XML editors
  • Kate or KXML, KDE-based XML editors
  • jEdit or Eclipse, complete Java or gcj-based development environments for basically everything including XML

All of the above are available for download, many through Red Hat Network for Red Hat® Enterprise Linux® systems, or yum for Fedora Core systems.

This is by no means an exhaustive list, and there are a number of non-free XML editors, in either the "beer" or "speech" sense, available on the open market as well. Of course, if you prefer a standard console interface, you can also use Vim, jed, or any of a number of other text editors. Most XML editors share a palette of great features that help you ensure the consistency and validity of your XML files.

One of the easiest ways to edit XML is using Emacs with PSGML. Emacs uses modes to address different types of files, and PSGML adds a full-featured Emacs mode for SGML and XML. This is the preferred tool at the Fedora Documentation Project, and once you have six or seven basic key combinations memorized, it is an incredibly efficient and powerful one.

Emacs can automatically display tag choices; organize, tag, re-tag, and compile your documents; rearrange whole element structures with a few keystrokes; and of course validate your document. The remainder of the article references Emacs keyboard commands, but the concepts are transferable to any editor you choose.

Creating DocBook XML Content

First ensure you have the necessary files installed. In Fedora Core, run the following commands:

su -c 'yum groupinstall "Authoring and Publishing"'
su -c 'yum install emacs psgml'

To create a DocBook XML file, simply add an XML declaration and a DOCTYPE declaration at the top of your document:

<?xml version="1.0" standalone="no"?>
<!DOCTYPE article PUBLIC "-//OASIS//DTD DocBook XML V4.2//EN"
 "http://www.docbook.org/xml/4.2/docbookx.dtd">

If you are interested in the meaning of these declarations, refer to Chapter 2 of TDG. You can choose any element in the DocBook DTD as the root element for the DOCTYPE declaration, but most people choose article, or book for very long works. The root element has nothing to do with the number of files you use; you may break your content into files practically without limitation.

After you add your declarations, in Emacs issue the keystrokes Ctrl-C, Ctrl-P, which parses the document prolog. (Note that many of the mode-specific commands in Emacs start with Ctrl-C.) In this case the prolog is the DocBook XML V4.2 DTD. Once you do this, you can use Ctrl-C, < to open the tag for a new element. If you hit the Tab key, Emacs opens a list of all valid tags, based on the actual DTD. You can hit Tab repeatedly to browse the list. If you type a partial completion for the tag and then hit Tab, Emacs finishes the tag name for you. When you hit Enter, the tag is inserted in your document.

To close a tag, hit Ctrl-C, /. Some elements have required content, so if you try to close a tag prematurely, Emacs warns you in the status bar with the message:

Can't end element here

There are many Emacs tutorials available online that detail all the powerful key commands at your disposal. For a very brief rundown of some of them, refer to this section in the Fedora Documentation Project's Documentation Guide.

When you are just starting out with DocBook, it is difficult to know which element is most useful for specific content. While learning DocBook, I found it helpful to keep a copy of TDG open on my desktop, where I would often browse the element list. Each element listed in TDG has a dedicated page listing its valid content, attributes, and parentage, and often some example usage. After a few paragraphs of writing, the tags quickly become second nature.

Converting DocBook Documents

For all DocBook's flexibility and ease of creation, most users are not interested in visually parsing the XML source. They would rather read the content in a comfortable format such as HTML or PDF; DocBook is only concerned with the proper categorizing of content, and not with the presentation of that content. The DocBook DTD doesn't declare, for example, how large headings should be, which text should be italicized, or what page margins should be. These decisions are left up to other processing engines.

You can invoke some of the standard processors using the helper command xmlto, which is a wrapper for common XML processing programs. The xmlto command converts documents into a number of different formats, including HTML and XHTML (either as one long file or separate "chunks"), PostScript, man pages, and others.

To convert your DocBook document to a series of HTML files, for example, use the following command:

    
xmlto html my-docbook-file.xml

If you use the option html-nochunks instead, the HTML is created as a single monolithic page.

Unfortunately, xmlto currently produces unattractive PDF files, but there are alternatives available. The docbook2* script wrappers may be useful in this regard. These helper scripts use openjade to convert DocBook source into other formats. If you have the ghostscript package installed, try these commands to generate a more readable PDF output:

docbook2ps my-docbook-file.xml
ps2pdf my-docbook-file.ps

Both xmlto and openjade are processors that rely on stylesheet implementations to transform DocBook source into other formats. A stylesheet is a way of changing the presentation of content. You may already know about the Cascading Style Sheets (CSS) standard, which is used by Web authors and designers to present HTML content in attractive ways. Similarly, the openjade system uses a powerful stylesheet language called Document Style Semantics and Specification Language (DSSSL) to present DocBook in different formats such as HTML and PostScript. DSSSL was originally designed for SGML, though, and is not as widely used today. Instead, many XML authors and programmers use the Extensible Stylesheet Language (XSL) to do the same work for XML files, including DocBook XML.

Next Installment

The next article in this series will present some ways you can use XSL stylesheets to exchange information to and from XML files, including DocBook, or transform them into entirely different structures. See you then!

About the author

Paul W. Frields is an engineer with a background in digital forensics and investigation who has taught Linux to hundreds of technical and law enforcement professionals. He spends part of his spare time working on odds and ends for the Fedora Project, especially documentation. The other part is devoted to his wife and children, and his part-time work as a professional musician.