Wednesday, 30 July 2014

Generating PDFs with dblatex on Fedora

Recently, I experimented with converting DocBook to PDF using dblatex and this post summarizes my experience of installing and running dblatex on Fedora 20. But first I want to consider the motivation for doing this: why should anyone be interested in dblatex in the first place?

If you write your documentation in DocBook and you want to produce PDF output of a professional standard, there are a limited number of options available to you. For example:
  • FO processors (open source or commercial)
  • HTML-to-PDF converters (open source or commercial)
  • DocBook-to-LaTeX converters (open source)

FO Processors 

In the FO (Formatted Objects) approach, DocBooks is converted to an intermediate format, an XML FO file, which is then converted to PDF. This approach was intended to be the canonical tool chain for publishing DocBook as PDF (and is fully supported by the standard DocBook XSL toolkit). Unfortunately, the FO approach seems to have run into difficulties in recent years. The FO processor is a complex tool (for example, see Apache FOP), requiring expert knowledge to fix bugs, add new features, or even just to tweak an FO template. So much so, that people are starting to turn away from FO and look for alternatives.

HTML-to-PDF Converters

The HTML-to-PDF converters offer a much simpler approach. These are available both in open source (for example, wkhtmltopdf) and commercial (for example, Prince XML) varieties. In this case, you convert DocBook to html-single format and then use the HTML-to-PDF converter to produce the final PDF output. The output that you get looks a bit like what you would get if you print a HTML page from a browser. In fact, this is no coincidence, because the wkhtmltopdf tool is based on WebKit, which is the engine that the Safari browser uses to print HTML pages. This approach is simple, but offers limited options for fine-tuning the page layout.

DocBook-to-LaTeX

Finally, there is the DocBook-to-LaTeX approach. LaTeX occupies something of a publishing niche, since it is mainly used for publishing in academia. But it has some notable strengths:
  1. LaTeX has strong page layout capabilities. The requisites options are not necessarily easy to tweak, but they are there if you need them.
  2. LaTeX has strong localisation support, through the BabelTex packages. In particular, it includes support for right-to-left languages, such as Hebrew and Arabic, in the ArabTex package.
  3. A long history and a large user base, which suggest that this tool is not going to disappear anytime soon.
And it is also noteworthy (although not really relevant to documentation in the software industry) that LaTeX is superlatively good at typesetting mathematical equations. Really, nothing else comes close.

Install and Run dblatex

It is straightforward enough to install and run dblatex, but there are a few gotchas to watch out for. If you are lucky enough to have a Fedora system, installing dblatex is easy with yum:

sudo yum install dblatex

Now, if you have a straightforward DocBook book file, MyBook.xml, you can convert it to PDF with this simple command:

dblatex MyBook.xml

The output you get is a PDF file, MyBook.pdf (the intermediate files are automatically deleted at the end of the processing). However, converting my sample book (the Camel Development Guide) was a bit more complicated, because our documents use olink elements for cross-referencing between books. If you are familiar with the DocBook XSL toolkit, you will know that this means you need to generate link database files and a site layout file (site.xml). Assuming you already have these files (they can be generated using DocBook XSL), you need to specify the following properties to the dblatex command: target.database.document, which should be set to the absolute location of the site.xml file; and current.docid, which is the value of your book element's xml:id attribute. So, for the Camel Development Guide, I needed to enter the following dblatex command:

dblatex -P target.database.document=`pwd`/site.xml -P current.docid=CamelDev camel_dev/camel_dev.xml

(Note the use of `pwd` to get the absolute path to site.xml. This should not really be necessary; it's a minor bug in the dblatex tool.) So, what happened after running this command? Well, dblatex converted the 449 page long document without a hitch, which is a good start. On the other hand, what you get initially is the 'standard' LaTeX document look-and-feel (if you have ever seen the output of a typical LaTeX document, you'll know what I mean). So, the next question that presents itself is: how can I customize the layout and styling so that it has a more professional look-and-feel? I don't know the answer to that question yet, but it might well be the subject of a later blog post.

Friday, 6 June 2014

GitCCMS Workflow for Updating Content Specs

In Pressgang, a content spec is an outline (like a table of contents) that assembles topics into a complete document. You can think of it as being analogous to a DITA map. As mentioned in a previous post, the GitCCMS utility can generate the content spec for you, if you enter the following command:

$ gitccms cs

Putting the generated content spec file into the location, cspec/<BookName>/contentspec.spec. You can then upload the new content spec using the standard Pressgang utility, csprocessor, as follows:

$ cd cspec/<BookNaem>
$ csprocessor create contentspec.cspec
...
Content Specification ID: 22781
Revision: 664539
...

If the content specification is successfully created, the new content spec ID is logged to the console (in this case, 22781).

But what if you make some changes to the structure of the book in your git repository and you need to update the content spec? How do you go about that using GitCCMS? First of all, you need to store the new content spec ID in your .git-ccms configuration file. Using your favourite text editor, open the .git-ccms file and add a cspec attribute to the relevant book element. For example, if you had just created the content spec for the book, esb_migration.xml, edit the configuration file as follows:

<context>
    <books>
        ...
        <book file="esb_migration/esb_migration.xml" cspec="22781"/>
        ...

And don't forget to commit that change:

git add .git-ccms
git commit -m "Added cspec ID for esb_migration book"

Now you are ready to regenerate the content spec using GitCCMS, as follows:

$ gitccms cs

If you take a peek at the regenerated content spec file, cspec/<BookName>/contentspec.spec., you can see that the ID is now included:

ID = 22781
# Book_Info.xml content
Title = Migration Guide
Subtitle = Migrating to Red Hat JBoss Fuse 6.1
...

You can now upload this revised content spec to Pressgang using the csprocessor push command, as follows:

$ cd cspec/<BookNaem>
$ csprocessor push contentspec.cspec
...
Content Specification ID: 22781
Revision: 664546
...

Wednesday, 21 May 2014

Integrate Git with Pressgang CCMS using GitCCMS

As explained in the previous post, we recently developed a python script called GitCCMS, which functions as a bridge between the git revision control system and the Pressgang CCMS server (Red Hat's internal content management system). At the moment, the script is alpha quality only and is currently a one-way bridge: you can push content from git to Pressgang CCMS, but there is no support (yet) for pulling content back.

If you want to try it out, the GitCCMS script is available online at GitHub:

https://github.com/fbolton/GitCCMS

The script is useful, only if you have access to a Pressgang CCMS server.

Features


GitCCMS has the following features:
  • Supports DocBook 5 and (almost) DocBook 4.
  • If the source doc is DocBook 5, it gets converted to DocBook 4 format on the fly and uploaded as DocBook 4 (this is to optimize compatibility with existing content on Pressgang, which is mostly in DocBook 4 format).
  • Uploads topics and images.
  • Can generate content specs automatically.
  • Supports conditional text.
  • Supports XML entities.
  • Supports olinks (if you create a link between books, the link will be converted to pure text before it is uploaded to Pressgang, so it does not rely on Pressgang to interpret the olinks).

Configuring GitCCMS


The first thing you need to do is to create a configuration file for GitCCMS in your git repository.

Assuming that you have a repository, MyGitRepo, the first thing you need to do is to create a configuration file for GitCCMS. The configuration file must be called .git-ccms and must be stored in the root directory of the repository, MyGitRepo. Here is an example of a .git-ccms file:


<?xml version="1.0" encoding="UTF-8"?>
<context>
    <books>
        <book file="BookA/BookA.xml"/>
        <book file="BookB/BookB.xml"/>
    </books>
    <entities file="Library.ent"/>
    <ignoredirs>
        <dir name="camel"/>
        <dir name="amq"/>
    </ignoredirs>
    <topicelements>
        <element tag="simplesect"/>
        <element tag="section"/>
        <element tag="info"/>
    </topicelements>
    <profiles>
        <profile name="default">
            <project name="Fuse">
                <category name="Assigned Writer">
                    <tag name="johndoe"/>
                    <tag name="janedoe"/>
                </category>
                <category name="Product">
                    <tag name="JBoss Fuse"/>
                    <tag name="JBoss A-MQ"/>
                </category>
                <category name="Release">
                    <tag name="6.1"/>
                </category>
            </project>
            <conditions>
              <condition match="jbossfuse"/>
              <condition match="comment"/>
            </conditions>
        </profile>
    </profiles>
</context>
 
Note the following points about this GitCCMS configuration:
  • The books element is used to list all of the book files in your git repository.
  • The entities element specifies a file containing the definitions of all the XML entities appearing in your books.
  • The ignoredirs element is useful, if you are using git submodules. The files under the specified directories will not be considered when it comes to creating or updating topics and images (but they are considered for the purpose of compiling tables of cross-references).
  • The topicelements element is used to specify which DocBook element types are mapped to topics in the Pressgang CCMS. It is probably best to set this element exactly as shown above. It is not really very flexible at the moment.
  • The conditions tag is used to specify the conditions that are enabled when you are using conditional text (this is a standard DocBook feature).
  • The project/category/tag elements are used to specify what tags are assigned to newly created topics.
  • Some of the configuration settings are specified inside a profile element. In the future, this should allow you to switch configurations easily. But at the moment, only one profile should be used and it must be called default.

Using GitCCMS


Assuming that you have configured GitCCMS as described above, you are now ready to start using GitCCMS. If you have not done so already, you can add the gitccms script to your PATH. For example, on a Linux or UNIX platform:

export PATH=<GitCCMSInstall>/bin:$PATH

You can push your DocBook source from the git repository up to the Pressgang server by entering the following command (which must be executed from the top-level directory of your git repository):

gitccms push

By default, this command pushes content to Red Hat's development server (used for testing only). Alternatively, you can specify the Pressgang hostname explicitly using the --host option. If the script completes successfully, you should see that a new hidden file, .git-ccms-commit, is created. This file records the SHA of the last commit that was synchronized with  Pressgang. GitCCMS uses this SHA to figure out what changes between synchonizations.

Generating content specs


After uploading your documentation to the Pressgang server, you can generate content specs for all of your books by entering the following command:

gitccms cs

You can find the generated specs under the cspec/ directory. You can upload the content specs to Pressgang CCMS using the standard Pressgang csprocessor client (not part of GitCCMS).

Monday, 19 May 2014

Migrating to Pressgang CCMS

When FuseSource was originally acquired by Red Hat, back in September 2012, we brought with us half a dozen git repos containing our documentation source in DocBook 5 format. We soon discovered that Red Hat had quite a different documentation infrastructure and tool chain from the homegrown tools we used at FuseSource.  In particular, a Content Management System (CMS) called Pressgang CCMS (a topic-based CMS) was shaping up to the be future of documentation in Red Hat

Initially, migrating to Pressgang CCMS posed some challenges that we were not in a position to solve right away. So, we put the migration off for a while. Now that we have released JBoss Fuse 6.1, we have the time to tackle it at last. Since we first looked at Pressgang CCMS a year and a half ago, a lot of things have changed. In particular, it is now much easier to import documentation into Pressgang using its built-in import tool. Support for DocBook5, XML entities, and conditional text have also been added, which makes migration a lot easier.

But we found it hard to face saying 'goodbye' to git revision control. Git offers us a number of features not available in Pressgang CCMS. Conversely, Pressgang CCMS also provides features not available in git.

Advantages of using Git

Here are some of the advantages that git offers us when managing documentation:
  • Branching
  • Ability to detect and merge conflicting changes made by different
    writers to the same topics.
  • Commits [that is, the capability to save a set of related changes in
    multiple topics in one atomic operation]
  • Simplified workflow for integrating and merging upstream Community docs,
    when those docs are in DocBook format.
  • Cherry-picking bug fixes across branches
  • Ability to grep/sed/find/apply any script to the entire doc set in a
    single operation, when needed.
  • Ability to commit changes when working offline.

Advantages of using Pressgang

On the other hand, Pressgang has some pretty nice features too:
  • Ability to share topics across all Red Hat products
  • Querying for metadata on topic re-use
  • Automatic book builds and continuous book building
  • Integration with Zanata translation software and localization processes
  • Centralized quality checks [spelling, link-checking, compliance with style guide]

Having it all

The solution to this dilemma, evidently, is to have your cake and eat it too. What if we had a tool that created a bridge between git and Pressgang CCMS, so that our documentation source could be moved freely back and forth between them? We have been working on such a tool over the past few months: a python script called GitCCMS. That will be the subject of the next blog post.

Tuesday, 30 April 2013

Documentation and Version Control

In this post I'm going to take a look at the version control requirements for storing and archiving documentation. It's worth considering what those requirements are, because they are not identical to the version control requirements for developing application code. Documentation requires many, but not all, of the features offered by classic developer-oriented revision control systems. On the other hand, many commercial Content Management Systems (CMS) do not offer the kind of flexibility that is required to maintain a professional documentation repository.

Here are the features I consider essential for a documentation-friendly revision control system:
  • Resolving collisions
  • Atomic commits
  • Revert/undo commits
  • Diffs between commits
  • Branching
  • Sub-projects
And here are some features I consider nice-to-have:
  • Merging branches
  • Cherry picking
And, finally, a non-requirement:
  • Tagging

Resolving Collisions

If there is more than one person on your docs team, it is reasonable to suppose that, sooner or later, you are going to collaborate on a document. In this case, it would be supremely annoying if you both happen to submit changes to the same file at the same time, and the documentation system simply over-writes one of the versions. The documentation system must therefore have some means of detecting and resolving such collisions. This is especially important if writers want to work offline (and they usually do), because collisions are then more likely to occur.

Revision control systems usually tackle this problem in one of two ways. Either by locking the file (so that one writer gets exclusive rights to update the file for as long as the lock is held) or by merging (where the revision control system requires you to merge the changes previously made by other users before you can upload your own changes).

Atomic Commits

Atomic commit means that you can submit multiple changes to multiple files in a single commit step. This has several advantages:
  • The commit log is much less cluttered, because you can group several changes into a single commit entry. 
  • It helps to keep the docs in a consistent state. For example, if you change the ID of a link target, this might break one or more cross-references and it might take multiple subsequent changes to multiple files to fix all of the broken links. If you can roll all of these changes into a single commit, you can ensure that the cross-references remain unbroken, both before and after the commit.

Revert/Undo Commits

From time-to-time we all make mistakes, so the capability to undo a commit is a welcome feature. Strictly speaking, Git does not allow you to undo a commit, but it enables you to commit the inverse of a previous commit, which amounts to the same thing.

Diffs between Commits

Diffs between commits are every bit as useful for technical writers as they are for developers. They enable you to keep track of changes made by your collaborators; and they enable you keep track of your own changes. In fact, the ability to make diffs between commits is one of the major reasons for keeping a version history in the first place.

Branching

There are various ways you can put branches to good use in a revision control system. The most important application of branching, from a documentation perspective, is for tracking past versions of the documentation.

In a documentation repository, it is natural to create a separate branch for each release. So, for example, you might have branches for versions 1.0, 1.1, 1.2, 1.3, 2.0, and 2.1 of your documentation. Anytime you need to go back, say, to fix an error or to add a release note to an earlier version, all you need to do is to check out the relevant branch, make the updates, and re-publish the docs from that branch. Moreover, sometimes fixes or enhancements made to an earlier version can also be applied to the current version (or vice versa) and it is particularly nice, if you have the option of cherry-picking the updates between branches.

This is a basic requirement, if you intend to do any maintenance at all of older documentation versions (and given that your customers are likely to have support agreements for the older products, it seems pretty much inevitable).


Sub-Projects

In a complex product, it is likely that you will need to use sub-projects at some point (that is, a mechanism that enables you to combine several repositories into a single repository). This can become necessary, if a product consists of multiple sub-products, so that the corresponding library is created by composing several sub-libraries.

The kind of mechanisms you can use to implement sub-projects include svn:external references in SVN or submodules in Git.

Although Git is, in most respects, a wonderful revision control system, its implementation of submodules does suffer from a couple of drawbacks:

  • You cannot mount an arbitrary sub-directory of the external sub-project in the parent project (like you can do in SVN), only the root directory.
  • Whenever you update the parent directory (for example, by doing a git pull), Git does not automatically update the submodules. This is probably the correct policy for working with application code, where you need to be conservative about updating dependencies, in case you break the code. But in the case of documentation, you normally want the submodules to point at the latest commit in the referenced branch. It is a nuisance to have to constantly update the submodules manually and then check those updates into the parent project.

The fundamental reason why sub-projects are needed is because sub-products naturally evolve at different rates, and you normally need to pick specific versions of the sub-products to assemble a complex product. Using a sub-project mechanism enables you to mix and match sub-product versions at will. (You might think it is also possible to lump all of the sub-products into a single repository, but this has the serious limitation that you can only work with a single combination of product versions. If you also need to release another product that uses a different combination of sub-product versions, this approach becomes completely unworkable.)


Merging Branches

I hesitated before putting merging branches into the nice-to-have category. You might prefer to categorise it as must-have, and I won't argue with you. But if you don't have a merge capability, I think you can mostly work around it, in the context of documentation. The most important use of branches in documentation is for tracking different versions of a library and these kind of branches would normally never need to be merged.

Just because the capability to merge branches is not an absolute necessity, it does not mean that you do not need a merge capability at all. You certainly need to be able to merge commits in order to resolve conflicts between different users working on the same branch.

Cherry-Picking

Cherry-picking is the ability to apply the same commit to more than one branch. In Git, for example, the procedure is incredibly easy. You just make the changes to one branch; commit them; then check out another branch and apply the commit to this branch as well (in my Git UI, I can right-click on a commit and select Cherry Pick to apply it to the currently checked out branch).

Tagging

Contrary to what you might think, tagging is a not really necessary in a documentation repository.

For years, myself and my team-mates dutifully tagged the repository every time a documentation library was released. In the early years, we did it in SVN and now we are doing it in Git. But recently I realised that we never used any of these revision tags, not even once.

This is because, in practice, you are only ever interested in accessing the tip of a branch, not the tagged commit. For example, if I create a branch for a 1.3 release, the tip of this branch will always have the version of the docs that I want to use for re-publishing the 1.3 docs. If I correct some errors in the docs, update the release notes with some patch information, and so on, this will always be available at the tip of the branch. The tag that might have been created the first time the library was released is of absolutely no interest: it references an earlier commit in the branch, which is now out of date.

Wednesday, 22 August 2012

The Security Token Service

With the release of Fuse ESB Enterprise 7.0.1, the Web Services Security Guide (for Apache CXF) has been expanded to cover the Security Token Service (STS).

A full implementation of the STS was recently added to the Apache CXF codebase and this implementation has a highly modular and customisable architecture, as you can see from the following architecture overview:



For example, the token Issue operation can be customised by plugging in a SAMLTokenProvider or an SCTProvider (secure conversation token provider); and the token Validate operation can be customised by plugging in one of the token validators, SAMLTokenValidator, UsernameTokenValidator, X509TokenValidator, or SCTTokenValidator.

The STS implementation has a number of special features, including:

  • Support for embedding Claims data in issued tokens.
  • Support for the AppliesTo policy (which enables you to centralise token issuing requirements).
  • Support for security realms.

These are all described in the new doc, in The Security Token Services chapter.

Tuesday, 17 July 2012

New FAB Videos

Recently, I have worked on producing a couple of videos that explain Fuse Application Bundles (FABs). A FAB is basically a new way of deploying applications into an OSGi container that can make your life a whole lot easier. This technology has been developed by my engineering colleagues at FuseSource and is open sourced at Github.

If you have ever built and deployed OSGi bundles using Maven, you might have experienced the frustration of adding a whole lot of package dependencies into the Maven bundle plugin. You have already specified all of your dependencies as Maven dependencies, and here you are doing it all over again! Is it really necessary? Well, if you are using FABs, it's not. The key idea of FABs is to leverage the existing Maven dependency metadata and use that metadata to figure out the requisite OSGi package dependencies.

The first video explains this basic concept and also explains the difference between shared and non-shared dependencies in a FAB project:



As we started to use FABs in practical applications, it soon became clear how important it is to distinguish between dependencies already provided by the container and other artifacts. Recently, our engineering team has done a lot of work to make FABs smarter, so that they can recognise provided dependencies automatically.

The second video shows a practical example of how to prepare a Maven project for FAB deployment and explains the importance of setting the dependency's <scope> tag correctly: