Doc Infusion

Thursday 3 February 2022

Reusing Community Documentation - Part 1

For any writer working on a product based on an open source project, the question arises as to whether you should attempt to reuse content created by the open source community. Although reusing community content poses certain challenges, it can also bring great strategic benefits. Not only does this approach provide you with a head start when creating the first iteration of your documentation, it also makes the content more maintainable, as you can easily merge in updates made by the developer community.

Content reuse is not all plain sailing. As a technical writer working on product documentation, you would be lucky if you were in a position to decide exactly how to structure the open source / community version of the documentation you are reusing. In most cases, there is divergence—and a certain amount of tension—between the needs of the open source community and the needs of the commercial product. Developing strategies to deal with this divergence is the key to successful reuse of open source community documentation.

Let's look at sources of potential friction between open source content and product content.

Documentation Format or Markup Language

The choice of documentation format for the open source content is one of the most important factors affecting reuse. On the one hand, the format should be easy to learn and acceptable to community contributors—but on the other hand, the format should (ideally) also support sophisticated layout options and it should be possible to integrate it into your product documentation builds.

Some of the currently popular (and not so popular) formats for open source documentation include:

AsciiDoc
Markdown
Wiki markup
DocBook
DITA
LaTeX

In most cases, if you are serious about content reuse, it is worth switching your product documentation to match the format of the corresponding community documentation. In my experience, tools for converting between documentation formats are not accurate enough or reliable enough so that you could automatically switch back and forth between formats. Such tools are useful for a once-off conversion (followed by manual cleanup), but are unwieldy for tracking ongoing changes in the open source content.

Documentation Build Systems

Documentation build systems can also have an impact on content reusability. Although this impact is usually not too severe, build systems can mandate a specific directory structure, and can affect how you use variables and conditional text. This needs to be managed appropriately.

Here are some examples of popular documentation build systems used by open source communities:

Antora—for AsciiDoc
AsciiDoctor—for AsciiDoc
GitBook—for Markdown or AsciiDoc
DocBook XSL—for DocBook

Divergence of Content between Open Source and Product

The main challenge you typically face when reusing open source documentation is managing the divergence between the open source content and the product content. There are a variety of ways in which content can diverge and some kinds of content in the community documentation might need to be modified, as follows:

Project names—typically need to be replaced by branded product names.
Features unsupported in the product—need to be excluded from the product documentation.
Words like "supports" and "supported"—have a very different meaning, depending on whether they occur in the context of open source / community documentation or in product documentation. In community documentation, "we support feature X" simply means that "feature X" is available in the code (and you may use it at your own risk). But in a product, "we support feature X" means that your company is legally obliged to provide support for feature X.
Links to unwanted content—community contributors have a different yardstick for deciding what is appropriate content to link to, as compared to (corporate) product documentation. For example, community content may link to blog posts that quickly go out of date, or even link to blog posts written by a commercial rival of your company.
Links to code examples—community content often features links to (or direct inclusions of) example code. In the context of a product, such examples are often repackaged and re-released as product examples. In this case, you would want to modify links in the documentation to point at the productized example code instead of the community examples.
Inappropriate content—community content occasionally contains jokes or political comment that is inappropriate for product documentation.

Also, some additions and alterations to the content might need to be made in order fit within the context of the product documentation:

Modifications to the structure of a section. For example, section headings and anchor IDs might need to follow different conventions in the product documentation. There might also be a need for some boilerplate content around the beginning and end of each section.
Modifications to images.
Modifications to code samples.
Additional content that appears exclusively in the product documentation—for example, features available in the product, but not in the community edition.

Managing the Divergences

Considering the divergences discussed above, there are various strategies and techniques we can use to manage them:

Variables—provide an effective way of switching project names for product names, when reusing community content.
Conditional text—is an effective way to manage content that must appear only in the community documentation or only in the product documentation.
File organization—by separating out content you do not want to reuse into separate files, it becomes relatively easy to omit this content.
Scripting—you can process content taken from the community documentation with a script (for example, automatically searching and replacing content) as necessary.
Annotations + Scripting—you could add annotations (for example, in the form of comments) to the community content, which are then processed by scripts in order to reuse the content in the product.

Acceptability of Strategies for Managing Divergences

You might well find that some strategies suggested above for managing divergences are not usable in practice, because they are not acceptable to the community. Open source communities typically like to keep content neutral and do not like the idea of product-specific concerns encroaching on their content. Community attitudes can vary quite a lot, depending on how heterogeneous the make-up of the community. We can broadly categorize communities as follows:

Monocultural—a community dominated by one company. In this case, when looking to refactor community content, you are effectively negotiating with your colleagues. This scenario is usually straightforward and gives you maximum flexibility for coordinating the open source documentation with the product documentation.
Moderately diverse—a community consisting of a small number of collaborating companies. In this case, there is likely to be a close level of collaboration amongst contributors and a good degree of openness to refactoring the community content.
Highly diverse—a community consisting of a large number of companies, including your commercial rivals. In this case, the community typically puts a premium on keeping the content neutral. Any hint of skewing the content to suit one particular product would trigger resistance and is likely to be rejected. This scenario leaves you with limited options for managing divergences between the open source content and the product content.

In the light of these community types, we need to reconsider the various strategies for managing divergences:

Variables—fairly uncontroversial and typically acceptable for all community types.
Conditional text—a borderline case. You might be able to gain acceptance for this approach—provided you employ product-neutral labels for the conditions—but some communities might reject the use of conditional text altogether.
File organization—probably uncontroversial, as long as you are proposing a clean, well-organized file structure. But it would probably involve some negotiation.
Scripting—uncontroversial, because it does not involve the community. All of your scripted modifications can be applied on the product side of the content.
Annotations + Scripting—this combination is more controversial, because it requires you to add annotations directly to the open source content. This might be acceptable if the annotations are couched in a product-neutral way. But there is a fairly high probability it could be rejected by the community.

Conclusion

In this blog post, we've considered the factors that affect reuse of open source content. There are a variety of techniques available that can enable you to effectively manage the divergence between open source content and product content, but these techniques are not always acceptable to the coummunity you are collaborating with.

In our next blog post, we will take a look at a content management technique that could be used to manage divergences, even when you are blocked from making changes to the community content.

Friday 28 January 2022

Blog Revival in 2022

After a long stint working at Red Hat, I have decided to strike out on my own as a self-employed technical writing contractor (since January 2022). This has incidentally given me a window of opportunity to get back to my (rather long-neglected) blog and post some new entries on a variety of things I have learnt about technical writing in recent years.

The first theme I would like to tackle is the subject of reusing documentation from open source / community projects. This is an area I have been heavily involved in during recent years at Red Hat and I plan to make a series of posts on this subject over the next few weeks—with a new post planned for each Thursday of the week. Stay tuned!

Wednesday 30 July 2014

Generating PDFs with dblatex on Fedora

Recently, I experimented with converting DocBook to PDF using dblatex and this post summarizes my experience of installing and running dblatex on Fedora 20. But first I want to consider the motivation for doing this: why should anyone be interested in dblatex in the first place?

If you write your documentation in DocBook and you want to produce PDF output of a professional standard, there are a limited number of options available to you. For example:

FO processors (open source or commercial)
HTML-to-PDF converters (open source or commercial)
DocBook-to-LaTeX converters (open source)

FO Processors

In the FO (Formatted Objects) approach, DocBooks is converted to an intermediate format, an XML FO file, which is then converted to PDF. This approach was intended to be the canonical tool chain for publishing DocBook as PDF (and is fully supported by the standard DocBook XSL toolkit). Unfortunately, the FO approach seems to have run into difficulties in recent years. The FO processor is a complex tool (for example, see Apache FOP), requiring expert knowledge to fix bugs, add new features, or even just to tweak an FO template. So much so, that people are starting to turn away from FO and look for alternatives.

HTML-to-PDF Converters

The HTML-to-PDF converters offer a much simpler approach. These are available both in open source (for example, wkhtmltopdf) and commercial (for example, Prince XML) varieties. In this case, you convert DocBook to html-single format and then use the HTML-to-PDF converter to produce the final PDF output. The output that you get looks a bit like what you would get if you print a HTML page from a browser. In fact, this is no coincidence, because the wkhtmltopdf tool is based on WebKit, which is the engine that the Safari browser uses to print HTML pages. This approach is simple, but offers limited options for fine-tuning the page layout.

DocBook-to-LaTeX

Finally, there is the DocBook-to-LaTeX approach. LaTeX occupies something of a publishing niche, since it is mainly used for publishing in academia. But it has some notable strengths:

LaTeX has strong page layout capabilities. The requisites options are not necessarily easy to tweak, but they are there if you need them.
LaTeX has strong localisation support, through the BabelTex packages. In particular, it includes support for right-to-left languages, such as Hebrew and Arabic, in the ArabTex package.
A long history and a large user base, which suggest that this tool is not going to disappear anytime soon.

And it is also noteworthy (although not really relevant to documentation in the software industry) that LaTeX is superlatively good at typesetting mathematical equations. Really, nothing else comes close.

Install and Run dblatex

It is straightforward enough to install and run dblatex, but there are a few gotchas to watch out for. If you are lucky enough to have a Fedora system, installing dblatex is easy with yum:

sudo yum install dblatex

Now, if you have a straightforward DocBook book file, MyBook.xml, you can convert it to PDF with this simple command:

dblatex MyBook.xml

The output you get is a PDF file, MyBook.pdf (the intermediate files are automatically deleted at the end of the processing). However, converting my sample book (the Camel Development Guide) was a bit more complicated, because our documents use olink elements for cross-referencing between books. If you are familiar with the DocBook XSL toolkit, you will know that this means you need to generate link database files and a site layout file (site.xml). Assuming you already have these files (they can be generated using DocBook XSL), you need to specify the following properties to the dblatex command: target.database.document, which should be set to the absolute location of the site.xml file; and current.docid, which is the value of your book element's xml:id attribute. So, for the Camel Development Guide, I needed to enter the following dblatex command:

dblatex -P target.database.document=`pwd`/site.xml -P current.docid=CamelDev camel_dev/camel_dev.xml

(Note the use of `pwd` to get the absolute path to site.xml. This should not really be necessary; it's a minor bug in the dblatex tool.) So, what happened after running this command? Well, dblatex converted the 449 page long document without a hitch, which is a good start. On the other hand, what you get initially is the 'standard' LaTeX document look-and-feel (if you have ever seen the output of a typical LaTeX document, you'll know what I mean). So, the next question that presents itself is: how can I customize the layout and styling so that it has a more professional look-and-feel? I don't know the answer to that question yet, but it might well be the subject of a later blog post.

Friday 6 June 2014

GitCCMS Workflow for Updating Content Specs

In Pressgang, a content spec is an outline (like a table of contents) that assembles topics into a complete document. You can think of it as being analogous to a DITA map. As mentioned in a previous post, the GitCCMS utility can generate the content spec for you, if you enter the following command:

$ gitccms cs

Putting the generated content spec file into the location, cspec/<BookName>/contentspec.spec. You can then upload the new content spec using the standard Pressgang utility, csprocessor, as follows:

$ cd cspec/<BookNaem>
$ csprocessor create contentspec.cspec
...
Content Specification ID: 22781
Revision: 664539
...

If the content specification is successfully created, the new content spec ID is logged to the console (in this case, 22781).

But what if you make some changes to the structure of the book in your git repository and you need to update the content spec? How do you go about that using GitCCMS? First of all, you need to store the new content spec ID in your .git-ccms configuration file. Using your favourite text editor, open the .git-ccms file and add a cspec attribute to the relevant book element. For example, if you had just created the content spec for the book, esb_migration.xml, edit the configuration file as follows:

<context>
    <books>
        ...
        <book file="esb_migration/esb_migration.xml" cspec="22781"/>
        ...

And don't forget to commit that change:

git add .git-ccms
git commit -m "Added cspec ID for esb_migration book"

Now you are ready to regenerate the content spec using GitCCMS, as follows:

$ gitccms cs

If you take a peek at the regenerated content spec file, cspec/<BookName>/contentspec.spec., you can see that the ID is now included:

ID = 22781
# Book_Info.xml content
Title = Migration Guide
Subtitle = Migrating to Red Hat JBoss Fuse 6.1
...

You can now upload this revised content spec to Pressgang using the csprocessor push command, as follows:

$ cd cspec/<BookNaem>
$ csprocessor push contentspec.cspec
...
Content Specification ID: 22781
Revision: 664546
...

Wednesday 21 May 2014

Integrate Git with Pressgang CCMS using GitCCMS

As explained in the previous post, we recently developed a python script called GitCCMS, which functions as a bridge between the git revision control system and the Pressgang CCMS server (Red Hat's internal content management system). At the moment, the script is alpha quality only and is currently a one-way bridge: you can push content from git to Pressgang CCMS, but there is no support (yet) for pulling content back.

If you want to try it out, the GitCCMS script is available online at GitHub:

https://github.com/fbolton/GitCCMS

The script is useful, only if you have access to a Pressgang CCMS server.

Features

GitCCMS has the following features:

Supports DocBook 5 and (almost) DocBook 4.
If the source doc is DocBook 5, it gets converted to DocBook 4 format on the fly and uploaded as DocBook 4 (this is to optimize compatibility with existing content on Pressgang, which is mostly in DocBook 4 format).
Uploads topics and images.
Can generate content specs automatically.
Supports conditional text.
Supports XML entities.
Supports olinks (if you create a link between books, the link will be converted to pure text before it is uploaded to Pressgang, so it does not rely on Pressgang to interpret the olinks).

Configuring GitCCMS

The first thing you need to do is to create a configuration file for GitCCMS in your git repository.

Assuming that you have a repository, MyGitRepo, the first thing you need to do is to create a configuration file for GitCCMS. The configuration file must be called .git-ccms and must be stored in the root directory of the repository, MyGitRepo. Here is an example of a .git-ccms file:

<?xml version="1.0" encoding="UTF-8"?>
<context>
    <books>
        <book file="BookA/BookA.xml"/>
        <book file="BookB/BookB.xml"/>
    </books>
    <entities file="Library.ent"/>
    <ignoredirs>
        <dir name="camel"/>
        <dir name="amq"/>
    </ignoredirs>
    <topicelements>
        <element tag="simplesect"/>
        <element tag="section"/>
        <element tag="info"/>
    </topicelements>
    <profiles>
        <profile name="default">
            <project name="Fuse">
                <category name="Assigned Writer">
                    <tag name="johndoe"/>
                    <tag name="janedoe"/>
                </category>
                <category name="Product">
                    <tag name="JBoss Fuse"/>
                    <tag name="JBoss A-MQ"/>
                </category>
                <category name="Release">
                    <tag name="6.1"/>
                </category>
            </project>
            <conditions>
              <condition match="jbossfuse"/>
              <condition match="comment"/>
            </conditions>
        </profile>
    </profiles>
</context>

Note the following points about this GitCCMS configuration:

The books element is used to list all of the book files in your git repository.
The entities element specifies a file containing the definitions of all the XML entities appearing in your books.
The ignoredirs element is useful, if you are using git submodules. The files under the specified directories will not be considered when it comes to creating or updating topics and images (but they are considered for the purpose of compiling tables of cross-references).
The topicelements element is used to specify which DocBook element types are mapped to topics in the Pressgang CCMS. It is probably best to set this element exactly as shown above. It is not really very flexible at the moment.
The conditions tag is used to specify the conditions that are enabled when you are using conditional text (this is a standard DocBook feature).
The project/category/tag elements are used to specify what tags are assigned to newly created topics.
Some of the configuration settings are specified inside a profile element. In the future, this should allow you to switch configurations easily. But at the moment, only one profile should be used and it must be called default.

Using GitCCMS

Assuming that you have configured GitCCMS as described above, you are now ready to start using GitCCMS. If you have not done so already, you can add the gitccms script to your PATH. For example, on a Linux or UNIX platform:

export PATH=<GitCCMSInstall>/bin:$PATH

You can push your DocBook source from the git repository up to the Pressgang server by entering the following command (which must be executed from the top-level directory of your git repository):

gitccms push

By default, this command pushes content to Red Hat's development server (used for testing only). Alternatively, you can specify the Pressgang hostname explicitly using the --host option. If the script completes successfully, you should see that a new hidden file, .git-ccms-commit, is created. This file records the SHA of the last commit that was synchronized with Pressgang. GitCCMS uses this SHA to figure out what changes between synchonizations.

Generating content specs

After uploading your documentation to the Pressgang server, you can generate content specs for all of your books by entering the following command:

gitccms cs

You can find the generated specs under the cspec/ directory. You can upload the content specs to Pressgang CCMS using the standard Pressgang csprocessor client (not part of GitCCMS).

Monday 19 May 2014

Migrating to Pressgang CCMS

When FuseSource was originally acquired by Red Hat, back in September 2012, we brought with us half a dozen git repos containing our documentation source in DocBook 5 format. We soon discovered that Red Hat had quite a different documentation infrastructure and tool chain from the homegrown tools we used at FuseSource. In particular, a Content Management System (CMS) called Pressgang CCMS (a topic-based CMS) was shaping up to the be future of documentation in Red Hat

Initially, migrating to Pressgang CCMS posed some challenges that we were not in a position to solve right away. So, we put the migration off for a while. Now that we have released JBoss Fuse 6.1, we have the time to tackle it at last. Since we first looked at Pressgang CCMS a year and a half ago, a lot of things have changed. In particular, it is now much easier to import documentation into Pressgang using its built-in import tool. Support for DocBook5, XML entities, and conditional text have also been added, which makes migration a lot easier.

But we found it hard to face saying 'goodbye' to git revision control. Git offers us a number of features not available in Pressgang CCMS. Conversely, Pressgang CCMS also provides features not available in git.

Advantages of using Git

Here are some of the advantages that git offers us when managing documentation:

Branching
Ability to detect and merge conflicting changes made by different
writers to the same topics.
Commits [that is, the capability to save a set of related changes in
multiple topics in one atomic operation]
Simplified workflow for integrating and merging upstream Community docs,
when those docs are in DocBook format.
Cherry-picking bug fixes across branches
Ability to grep/sed/find/apply any script to the entire doc set in a
single operation, when needed.
Ability to commit changes when working offline.

Advantages of using Pressgang

On the other hand, Pressgang has some pretty nice features too:

Ability to share topics across all Red Hat products
Querying for metadata on topic re-use
Automatic book builds and continuous book building
Integration with Zanata translation software and localization processes
Centralized quality checks [spelling, link-checking, compliance with style guide]

Having it all

The solution to this dilemma, evidently, is to have your cake and eat it too. What if we had a tool that created a bridge between git and Pressgang CCMS, so that our documentation source could be moved freely back and forth between them? We have been working on such a tool over the past few months: a python script called GitCCMS. That will be the subject of the next blog post.

Tuesday 30 April 2013

Documentation and Version Control

In this post I'm going to take a look at the version control requirements for storing and archiving documentation. It's worth considering what those requirements are, because they are not identical to the version control requirements for developing application code. Documentation requires many, but not all, of the features offered by classic developer-oriented revision control systems. On the other hand, many commercial Content Management Systems (CMS) do not offer the kind of flexibility that is required to maintain a professional documentation repository.

Here are the features I consider essential for a documentation-friendly revision control system:

Resolving collisions
Atomic commits
Revert/undo commits
Diffs between commits
Branching
Sub-projects

And here are some features I consider nice-to-have:

Merging branches
Cherry picking

And, finally, a non-requirement:

Tagging

Resolving Collisions

If there is more than one person on your docs team, it is reasonable to suppose that, sooner or later, you are going to collaborate on a document. In this case, it would be supremely annoying if you both happen to submit changes to the same file at the same time, and the documentation system simply over-writes one of the versions. The documentation system must therefore have some means of detecting and resolving such collisions. This is especially important if writers want to work offline (and they usually do), because collisions are then more likely to occur.

Revision control systems usually tackle this problem in one of two ways. Either by locking the file (so that one writer gets exclusive rights to update the file for as long as the lock is held) or by merging (where the revision control system requires you to merge the changes previously made by other users before you can upload your own changes).

Atomic Commits

Atomic commit means that you can submit multiple changes to multiple files in a single commit step. This has several advantages:

The commit log is much less cluttered, because you can group several changes into a single commit entry.
It helps to keep the docs in a consistent state. For example, if you change the ID of a link target, this might break one or more cross-references and it might take multiple subsequent changes to multiple files to fix all of the broken links. If you can roll all of these changes into a single commit, you can ensure that the cross-references remain unbroken, both before and after the commit.

Revert/Undo Commits

From time-to-time we all make mistakes, so the capability to undo a commit is a welcome feature. Strictly speaking, Git does not allow you to undo a commit, but it enables you to commit the inverse of a previous commit, which amounts to the same thing.

Diffs between Commits

Diffs between commits are every bit as useful for technical writers as they are for developers. They enable you to keep track of changes made by your collaborators; and they enable you keep track of your own changes. In fact, the ability to make diffs between commits is one of the major reasons for keeping a version history in the first place.

Branching

There are various ways you can put branches to good use in a revision control system. The most important application of branching, from a documentation perspective, is for tracking past versions of the documentation.

In a documentation repository, it is natural to create a separate branch for each release. So, for example, you might have branches for versions 1.0, 1.1, 1.2, 1.3, 2.0, and 2.1 of your documentation. Anytime you need to go back, say, to fix an error or to add a release note to an earlier version, all you need to do is to check out the relevant branch, make the updates, and re-publish the docs from that branch. Moreover, sometimes fixes or enhancements made to an earlier version can also be applied to the current version (or vice versa) and it is particularly nice, if you have the option of cherry-picking the updates between branches.

This is a basic requirement, if you intend to do any maintenance at all of older documentation versions (and given that your customers are likely to have support agreements for the older products, it seems pretty much inevitable).

Sub-Projects

In a complex product, it is likely that you will need to use sub-projects at some point (that is, a mechanism that enables you to combine several repositories into a single repository). This can become necessary, if a product consists of multiple sub-products, so that the corresponding library is created by composing several sub-libraries.

The kind of mechanisms you can use to implement sub-projects include svn:external references in SVN or submodules in Git.

Although Git is, in most respects, a wonderful revision control system, its implementation of submodules does suffer from a couple of drawbacks:

You cannot mount an arbitrary sub-directory of the external sub-project in the parent project (like you can do in SVN), only the root directory.
Whenever you update the parent directory (for example, by doing a git pull), Git does not automatically update the submodules. This is probably the correct policy for working with application code, where you need to be conservative about updating dependencies, in case you break the code. But in the case of documentation, you normally want the submodules to point at the latest commit in the referenced branch. It is a nuisance to have to constantly update the submodules manually and then check those updates into the parent project.

The fundamental reason why sub-projects are needed is because sub-products naturally evolve at different rates, and you normally need to pick specific versions of the sub-products to assemble a complex product. Using a sub-project mechanism enables you to mix and match sub-product versions at will. (You might think it is also possible to lump all of the sub-products into a single repository, but this has the serious limitation that you can only work with a single combination of product versions. If you also need to release another product that uses a different combination of sub-product versions, this approach becomes completely unworkable.)

Merging Branches

I hesitated before putting merging branches into the nice-to-have category. You might prefer to categorise it as must-have, and I won't argue with you. But if you don't have a merge capability, I think you can mostly work around it, in the context of documentation. The most important use of branches in documentation is for tracking different versions of a library and these kind of branches would normally never need to be merged.

Just because the capability to merge branches is not an absolute necessity, it does not mean that you do not need a merge capability at all. You certainly need to be able to merge commits in order to resolve conflicts between different users working on the same branch.

Cherry-Picking

Cherry-picking is the ability to apply the same commit to more than one branch. In Git, for example, the procedure is incredibly easy. You just make the changes to one branch; commit them; then check out another branch and apply the commit to this branch as well (in my Git UI, I can right-click on a commit and select Cherry Pick to apply it to the currently checked out branch).

Tagging

Contrary to what you might think, tagging is a not really necessary in a documentation repository.

For years, myself and my team-mates dutifully tagged the repository every time a documentation library was released. In the early years, we did it in SVN and now we are doing it in Git. But recently I realised that we never used any of these revision tags, not even once.

This is because, in practice, you are only ever interested in accessing the tip of a branch, not the tagged commit. For example, if I create a branch for a 1.3 release, the tip of this branch will always have the version of the docs that I want to use for re-publishing the 1.3 docs. If I correct some errors in the docs, update the release notes with some patch information, and so on, this will always be available at the tip of the branch. The tag that might have been created the first time the library was released is of absolutely no interest: it references an earlier commit in the branch, which is now out of date.