Issue #17 March 2006

Introduction to DocBook XML, part 2: XSLT

by Paul W. Frields


Introduction

Last month I introduced you to DocBook XML, a wonderful way to write documentation for software projects or just about any purpose. These articles, for instance, are written in DocBook XML. On the Fedora™ Documentation Project, we are using DocBook XML to produce release notes, tutorials and guides for Fedora users and administrators. Because DocBook has everything to do with content and very little to do with presentation, the author can use any of a number of tools, and concentrate on writing rather than formatting. Because DocBook XML uses standard XML technologies, there are plenty of ways to stylize and present the information contained in a document.

This month we're going to look at the Extensible Stylesheet Language, and in particular, XSL Transformation (XSLT) (XSL), which can be used to format or transform information in XML files. The Fedora Documentation Project, for example, uses XSLT to capture, process and convert XML information contained in our documentation files. I'll present several working examples of XSLT, so to follow along with this article and run them, you should have the "Authoring and Publishing" package group installed on your Red Hat Enterprise Linux or Fedora Core system. Use the appropriate software management utility for your platform. For Fedora Core use the following command:

su -c 'yum groupinstall "Authoring and Publishing"'

I will assume you've read last month's article, but nothing beyond that. I will steer clear of an exhaustive examination of XMLish jargon so as not to frighten anyone away. Just keep in mind that many of the concepts in this article have complex underpinnings, which you'll want to investigate if you want to peek "under the hood."

XSL Stylesheets

Of course there are a plethora of books available, a few published in electronic form on the Internet, that discuss XML and XSL. Although this article can't possibly cover all the details of this powerful and flexible technology, it can at least present some of the most rudimentary concepts. I would highly recommend, if you're just getting started with XSL, that you download and keep a copy of a good "cheat sheet." One of the best compact ones I've found for XSL is a tutorial called XSL Concepts and Practical Use written by Paul Grosso and Norman Walsh, which you can find at http://nwalsh.com/docs/tutorials/xsl/. You'll find information in that tutorial not just about XSLT, but also XSL Formatting Objects (XSL-FO), which is not covered in this article.

One way to use XSL is to simply write a stylesheet, which is itself an XML document. XSL is used frequently in DocBook XML processing tasks such as converting the XML source into another format. The xmlto command, for example, which we looked at briefly in Part One of this series, uses XSL stylesheets to create HTML pages. The easiest way to start learning a little about XSL is to see a stylesheet in action.

Example 1. authors.xsl

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
  version="1.0">
  <xsl:output method="xml" indent="yes" omit-xml-declaration="no"
    version="1.0" encoding="UTF-8" />

  <xsl:template name="people" match="/">
    <xsl:for-each select="//author|//editor">
      <xsl:element name="person">
	<xsl:attribute name="fullname">
	  <xsl:value-of select="firstname"/>
	  <xsl:text> </xsl:text>

	  <xsl:if test="othername != ''">
	    <xsl:value-of select="othername"/>
	    <xsl:text> </xsl:text>
	  </xsl:if>
	  <xsl:value-of select="surname"/>

	</xsl:attribute>
      </xsl:element>
    </xsl:for-each>
  </xsl:template>
</xsl:stylesheet>

Let's walk through the stylesheet to see what function it performs. Keep in mind that stylesheets are usually written against particular document types, but they make no hard and fast requirements for the input XML document. In fact, they will sometimes even work against an invalid document, although they may have problems with documents that are not well-formed.

The topmost line is the common XML declaration, and following that is the xsl:stylesheet element. This element is interesting because it declares a namespace that is used in this document to identify elements. Essentially, you could interpret this element as stating "This is an XSL stylesheet, by which 'XSL' means it adheres to the version 1.0 standards set out in the document at [this URL]." Namespaces are often used in XML to provide functions not present in the definition for the container document. Using namespaces will be a suitable topic for a separate article, so we'll save that for another time.

The next nested level in this document includes the following elements, which are not the only possible elements, but all that is required for our purposes:

  1. The xsl:output element, as you might expect, sets out rules for the output of this stylesheet. In this case, the output will be a UTF-8 encoded XML document, fully indented with the standard XML "processing instruction" header.

  2. The xsl:template element has the name people, and indicates which nodes in the original XML document are to be processed, and how.

A node is an atomic unit of valid XML. It might be an element, or an attribute, or a text string. Nodes are arranged in a tree in XML documents. The xsl:template matches and processes nodes based on its attributes and content. The match rule identifies specific nodes or groups of nodes, whether they are related in the tree or completely disparate. The matching syntax used is known as XPath, and incorporates a very flexible pattern-based system for locating elements. You can find a fuller explanation of XPath at the aforementioned XSL tutorial, but here are a few simple examples:

/

matches the root (top-level) node

//name

matches an element of type name anywhere in the document

foo/*/bar

matches any bar element that has a grandparent foo element in the current node (context)

book[@title="Infinite Jest"]

matches a book element in the current node (context) that has a title attribute with a value of Infinite Jest

Our template will match only the root of the document, which you should note is not the same as matching every node in the document. (That rule would be match="*".) The current node or context when the template is invoked, therefore, is the root of the document. For any author or editor element found, regardless of its location in the content tree (or infoset) the stylesheet will write a person element. That element will have a fullname attribute consisting of the following:

  1. the value of the source element's firstname attribute, followed by a space

  2. if the source element has an othername attribute, the value of that attribute, followed by a space

  3. the value of the source element's surname attribute

If our input XML document contains a node like the following example:

<author>

  <surname>Public</surname>
  <firstname>John</firstname>
  <othername role="mi">Q.</othername>
</author>

then the document resulting from a transformation using the above stylesheet will contain the following element:

<person fullname="John Q. Public"/>

Notice that, because the output element has no text content, but only attributes, it is called empty. Rather than using both an opening and a closing tag, it uses only a single self-closing tag, meaning the final angle bracket is prefaced with a slash. The fact that this output element has no text content does not change its intrinsic value as a node in the infoset.

Using xsltproc

The libxslt library contains the priceless xsltproc utility, which, among other functions, allows you to process XML documents with your XSL stylesheets. If you want to see xsltproc in action, copy and paste the XML file from last month's article into your favorite editor, and save it as original.xml. Similarly copy and paste the XSL stylesheet above to authors.xsl. Then run the following command:

xsltproc authors.xsl original.xml

The results should be XML output with new elements. This tiny, simple example shows you the power of XML for data interchange. XSLT provides instructions that allow you to move data easily between different XML document types, as this example demonstrates. Of course, this output could have been designed for a specific DTD, and notated accordingly. Alter the declaration of the xsl:output element slightly:

<xsl:output method="xml" indent="yes" 
  omit-xml-declaration="no" version="1.0" encoding="UTF-8" 
  doctype-public="-//Bogus//DTD RHM Example XML V0.01//EN"
  doctype-system="people.dtd" />

Now regenerate the output using the same xsltproc command, and note the difference in the output XML. If you had a DTD matching the public identifier and located at the URL specified (people.dtd), you could validate the resulting file for consistency with the DTD. Notice that the root element noted in the DOCTYPE is derived from the top-level element of the output. This feature allows you to easily extract part of an infoset into a separate document that uses the same DTD. Say for instance we wanted to extract our document's revision history to a separate file for some auditing purpose. We could write a very short XSLT for this purpose:

Example 2. revhist.xsl


<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" 
  version="1.0">
  <xsl:output method="xml" indent="yes" omit-xml-declaration="no" 
    version="1.0" encoding="UTF-8"
    doctype-public="-//OASIS//DTD DocBook XML V4.2//EN" 
    doctype-system="http://www.docbook.org/xml/4.2/docbookx.dtd"/>
  <xsl:template match="/">
    <xsl:for-each select="//revhistory">
      <xsl:copy-of select="."/>
    </xsl:for-each>

  </xsl:template>
</xsl:stylesheet>

Using XSLT to populate other documents

XSLT is not limited to outputting XML; it can output any kind of file, including strange binary formats. Needless to say, those types of formats require XSLT that is usually harder to read and understand, so we'll steer clear of them for now. You can easily imagine, however, using XML to populate a different kind of text file, such as a configuration file, which uses some sort of regular formatting.

The FDP is in the final stages of preparing a packaging process for official documentation that draws content directly from some of our XML source. Let's look at a small portion of XSLT from one of our source files:

Example 3. spec.xsl

 
<!-- Transform rpm-info.xml into a SPEC File -->
<xsl:stylesheet version="1.0" xml:space="preserve"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
  <xsl:output encoding="UTF-8" indent="no" method="text"
    omit-xml-declaration="no" standalone="no" version="1.0"/>

  <xsl:param name="lang" select="'en'" />
  <xsl:param name="docbase" select="'example-tutorial'" />

  <xsl:template match="/"># Fedora Documentation Specfile
%define	docbase	<xsl:value-of select="$docbase"/>
%define doclang <xsl:value-of select="$lang"/>
%{!?fdpdir:%define localbuild 1}
%{!?fdpdir:%define fdpdir %{_datadir}/fedora/doc}

Summary:	Fedora Documentation: %{docbase}-%{doclang}
Name:	        fedora-doc-%{docbase}
Version:	<xsl:value-of select="/rpm-info/changelog/revision[@role = 'doc'][1]/@number"/>

Release:	<xsl:value-of select="/rpm-info/changelog/revision[@role = 'rpm'][1]/@number"/>
...

You can find the current version of the entire XSLT in our CVS store at http://cvs.fedora.redhat.com/viewcvs/docs-common/packaging/spec.xsl?root=docs. Like much other XSLT, this stylesheet expects a certain kind of XML document for input. In this case it's an rpm-info document, whose DTD you can also find in CVS, at http://cvs.fedora.redhat.com/viewcvs/docs-common/packaging/rpm-info.dtd?root=docs.

Our packaging passes a couple of parameters to this stylesheet, including docbase and lang. These parameter values are used to populate the resulting specfile, which for the English (en_US) version of example-tutorial, might have a preamble that looks like this:

Example 4. Results of transformation via spec.xsl

 

# Fedora Documentation Specfile 
%define	docbase	example-tutorial 
%define doclang en_US
%{!?fdpdir:%define localbuild 1}
%{!?fdpdir:%define fdpdir %{_datadir}/fedora/doc} 
Summary:	Fedora Documentation: example-tutorial-en_US
Name:	        fedora-doc-example-tutorial 
Version:	0.14.1 
Release:	1 
...

You can feed xsltproc parameter values at the command line using the --param or --stringparam options. It's tempting to think the parameter declarations set explicitly in the stylesheet above would simply override any lang and docbase parameters received at invocation time. This is not the case, fortunately; instead, the first declaration of a parameter is given priority. Therefore, these declarations in the stylesheet ensure that default values are set. This functionality is helpful when you need consistently acceptable output, but it also can be useful by producing a visible indication in a file that no value was received from the calling procedure. for example, you could set the default fallback value for the parameter to FIXME.

Conclusion

Hopefully you've seen the potential for XSLT to transform not just your documents, but the way you use the information they provide. Although we've focused specifically on applying XSLT to DocBook XML source, it is equally powerful when used with any XML data store. This is why XML has become ubiquitous throughout business information enterprises: it ensures that your data is always accessible and never arbitrarily confined, regardless of the applications using it. Your data (or in the case of DocBook, documentation) can be leveraged to work in conjunction with your software, and vice versa.

About the author

Paul W. Frields is an engineer with a background in digital forensics and investigation who has taught Linux to hundreds of technical and law enforcement professionals. He spends part of his spare time working on odds and ends for the Fedora Project, especially documentation. The other part is devoted to his wife and children, and his part-time work as a professional musician.