Wednesday, 30 July 2014

Generating PDFs with dblatex on Fedora

Recently, I experimented with converting DocBook to PDF using dblatex and this post summarizes my experience of installing and running dblatex on Fedora 20. But first I want to consider the motivation for doing this: why should anyone be interested in dblatex in the first place?

If you write your documentation in DocBook and you want to produce PDF output of a professional standard, there are a limited number of options available to you. For example:
  • FO processors (open source or commercial)
  • HTML-to-PDF converters (open source or commercial)
  • DocBook-to-LaTeX converters (open source)

FO Processors 

In the FO (Formatted Objects) approach, DocBooks is converted to an intermediate format, an XML FO file, which is then converted to PDF. This approach was intended to be the canonical tool chain for publishing DocBook as PDF (and is fully supported by the standard DocBook XSL toolkit). Unfortunately, the FO approach seems to have run into difficulties in recent years. The FO processor is a complex tool (for example, see Apache FOP), requiring expert knowledge to fix bugs, add new features, or even just to tweak an FO template. So much so, that people are starting to turn away from FO and look for alternatives.

HTML-to-PDF Converters

The HTML-to-PDF converters offer a much simpler approach. These are available both in open source (for example, wkhtmltopdf) and commercial (for example, Prince XML) varieties. In this case, you convert DocBook to html-single format and then use the HTML-to-PDF converter to produce the final PDF output. The output that you get looks a bit like what you would get if you print a HTML page from a browser. In fact, this is no coincidence, because the wkhtmltopdf tool is based on WebKit, which is the engine that the Safari browser uses to print HTML pages. This approach is simple, but offers limited options for fine-tuning the page layout.


Finally, there is the DocBook-to-LaTeX approach. LaTeX occupies something of a publishing niche, since it is mainly used for publishing in academia. But it has some notable strengths:
  1. LaTeX has strong page layout capabilities. The requisites options are not necessarily easy to tweak, but they are there if you need them.
  2. LaTeX has strong localisation support, through the BabelTex packages. In particular, it includes support for right-to-left languages, such as Hebrew and Arabic, in the ArabTex package.
  3. A long history and a large user base, which suggest that this tool is not going to disappear anytime soon.
And it is also noteworthy (although not really relevant to documentation in the software industry) that LaTeX is superlatively good at typesetting mathematical equations. Really, nothing else comes close.

Install and Run dblatex

It is straightforward enough to install and run dblatex, but there are a few gotchas to watch out for. If you are lucky enough to have a Fedora system, installing dblatex is easy with yum:

sudo yum install dblatex

Now, if you have a straightforward DocBook book file, MyBook.xml, you can convert it to PDF with this simple command:

dblatex MyBook.xml

The output you get is a PDF file, MyBook.pdf (the intermediate files are automatically deleted at the end of the processing). However, converting my sample book (the Camel Development Guide) was a bit more complicated, because our documents use olink elements for cross-referencing between books. If you are familiar with the DocBook XSL toolkit, you will know that this means you need to generate link database files and a site layout file (site.xml). Assuming you already have these files (they can be generated using DocBook XSL), you need to specify the following properties to the dblatex command: target.database.document, which should be set to the absolute location of the site.xml file; and current.docid, which is the value of your book element's xml:id attribute. So, for the Camel Development Guide, I needed to enter the following dblatex command:

dblatex -P target.database.document=`pwd`/site.xml -P current.docid=CamelDev camel_dev/camel_dev.xml

(Note the use of `pwd` to get the absolute path to site.xml. This should not really be necessary; it's a minor bug in the dblatex tool.) So, what happened after running this command? Well, dblatex converted the 449 page long document without a hitch, which is a good start. On the other hand, what you get initially is the 'standard' LaTeX document look-and-feel (if you have ever seen the output of a typical LaTeX document, you'll know what I mean). So, the next question that presents itself is: how can I customize the layout and styling so that it has a more professional look-and-feel? I don't know the answer to that question yet, but it might well be the subject of a later blog post.

No comments:

Post a Comment