J. R. Boynton

Coaxing structure and text from Word

Web publishing, done efficiently, requires automation. Automation requires structure. XML and even html require structure. Ideally, creating valid structure is the responsibility of the same person who creates the text. The trick is to design a system that efficiently gives them immediate, accurate feedback about structure – much as they get spell checking in Word now.

Microsoft Word makes it very difficult to get structured content. Word is really designed so a user can just format each bit of text with bold or indents or as a list. There’s no way to prevent a user from doing this, even though you can use styles.

This essay examines various approaches to getting content out of Word in a way that will be useful for web publishing.

The Goal

The minimum goal is output of html structure tags without junk. We want “<p>”, not “<p><font size=2 face=arial color=black>”.

Slightly better would be “<p class=text>”. Even better is the switch to xml-based markup: “<text>”.

Word's Save As

For the record, Word’s Save As html was filled with junk. The latest scheme was to include all the formatting usually saved in .rtf files, but hide them in angle brackets (“<formatting junk goes here>”). They could save an “html” file with the ability to convert it straight back to a .doc file. Neither of these schemes is helpful for automation.

Internal or External Conversion

Word is highly programmable using Visual Basic for Applications and the Word document object model. The first problem with the document object model is that it’s very slow.

More fundamentally, the conversion should really take place outside of Word. The reason is that you want to be absolutely sure that the conversion is from the most recent change to the document. If the document is open and you write out an html or xml file, you could easily make another change and forget to write it out again.

It is much safer to do the conversion outside of Word. The Word source document should be stored in the content management system, and if the software notices that the time-stamp on the document is more recent than the last conversion, it should insist on converting the document again. It’s very bad to let documents get out of sync.

Doing the conversion outside of Word gives you the potential for a robust, cross-platform solution. It’s just not worth trying to run Word from a CMS application on a Unix system.

Internal or External structure validation

Again, the Word document object model is very slow, and VBA is proprietary. If structure validation is as simple as save in Word and click a link on a web-based form to run the validation, that seems good enough to me.

Visual or Semantic conversion

The biggest question is what to convert. Lists in Word aren’t specific styles. They are just a bullet and indent applied to whatever style the paragraph already was. You could convert a paragraph with a bullet and an indent to an html list tag. If the next paragraph is still indented but doesn’t have a bullet, you could assume it is part of the list, but not a new item.

The other approach is to demand users choose a list style in Word. One way is to have styles like “ul_level_1” and “ul_level_2”, for bulleted list items and nested list items. This is un-Word-like, and also un-html-like. Another way to do list styles in Word is to follow the html approach, in which there is a beginning and end tag to a list, and the list items between are indented appropriately, according to the nesting of the begin/end tags. This is definitely a more scalable solution, but it requires programming that isn’t built into Word.

Perhaps the ideal thing would be if Word allowed the template to determine whether to act as a purely visual formatting system, or to enforce the use of styles and structure. Microsoft has enough programmers to do this. Many users might find it more helpful than a paper clip animation. It would require them to fix the list functionality.

Word finally allows a template to prevent a user from picking a style that isn't in the template. One small step for structure!

Content Creator Cooperation

The good news is that any large-scale publishing system is going to require that users learn some new procedures, and that they cooperate with the system. At the very least, users need to know where to save or email files. Some large-scale publishing systems (such as Vignette) have extremely primitive interfaces – just html forms that you are supposed to paste text into, or even edit text within.

If users must learn new procedures, and they must cooperate, then you can assume cooperation, and not feel apologetic for imposing new procedures. In most cases, users are getting paid to do their work, and are willing to cooperate, anyway, especially when it’s clear that the slightly more complicated procedures are saving them lots of time and aggravation.

So my conclusion is that we should aim for semantic conversion, and not try to guess at visual formatting.

The ambiguity here is still lists. The good news about lists in Word is that nested lists are so difficult to create and maintain, and so easy to break, that they aren’t widely used. It comes down to the basic problem with Word: you can’t actually tell it what to do, the way the begin and end of list tags tell html what to do. Word guesses what you want to do, and does what it guessed. So to work with lists in Word, you always have to guess how to get Word to guess correctly what you want to do. Word makes it easier to do the simplest possible sorts of lists, and at the same time, quite intentionally, makes it very difficult to do anything more than the simplest thing.

You might use a VBA add-in to handle lists in an html-ish way, though perhaps not allowing nested lists. You could insist that users insert hidden paragraphs of style ul_begin, ul_end, etc., and that any paragraph they want to give a bullet or number to should use the li style. The VBA add-in can automate the formatting, and verify that the styles are ordered properly.

Otherwise, users would pick styles for most paragraphs, but lists would be guessed at, based on indenting and Word’s embedded bullets or numbers. Potentially, you could complain if they had indented and bulleted a paragraph of the wrong style.

Converting structure

The most common approach is to convert selected Word styles to styles desired in the publishing system according to a mapping maintained with the conversion software.

Another approach is to convert any paragraph or character style in the Word document to a tag of the same name. You could define styles such as h1 or p in Word, and convert them to tags directly.

To me, the second approach seems preferable. You could create a Word template with styles that match an xml dtd. The conversion software doesn’t need to know anything about the desired stylenames, it simply creates the tagged document. Then you can use an any xml validator. If the procedure for this is to save the document, and click once in a CMS gui, this strikes me as being efficient and sufficiently user-friendly.

Character Entity conversions

If you copy text out of Word and paste it into an html form, you can get operating system-specific characters such as em-dash, or quotes. The conversion process should replace these platform-specific codes with the standard character entities defined by the W3C.

Recommendation

I suggest making a Word template with styles to match an xml dtd. The template can have the structure of the document laid out, so users can mostly just start typing in existing paragraphs. I would consider a VBA add-in to manage lists in the same way as html. Save the documents in Word's xml format. Use a converter that will convert any stylename in the Word xml to an xml-like tag in the output. Run the converter and an xml validator from the CMS gui. Don't allow the document to be approved until it validates.


Copyright © 1998-2011 J. R. Boynton