The New Publishing

An Information Choreographer's Manifesto

Abstract

From the perspective of one who scripts the flow of information from creation, maintenance, pre-publishing phases, dispatch to the web server, and delivery to the consumer, the author claims we are in the midst of a revolution in publishing as profound as the printing press: there is no longer a single distribution medium (a book on paper), but rather distribution may take many forms, of which many will be beyond the control of the publisher. A normative value inherent in information -- access -- insists that websites be accessible to lowest common denominator devices: a cellphone, a PDA, Lynx on a computer in a library, a fancy laptop trapped in a bandwidth-impaired Prague hotel room. This creates tremendous advantage in preparing three versions of a website. Maintaining three versions of a website, in turn, requires careful scripting, design, and maintenance procedures. A new publishing infrastructure is described. Additional valuable analyses include why web publishing is so difficult, why the "players" backgrounds are in opposition to effective web publishing, and recommendations for the future of HTML.

The information choreographer's manifesto

"Manifesto" is a strong word, of course. Let's just say the point of the word is to point out the opportunity for radical re-thinking of the publishing infrastructure used in the web. Also, the demands of publishing on the internet really will open information up to people who are not so rich that they can buy new hardware every year or two.

Meanwhile "information choreographer" is certainly meant tongue-in-cheek, but again, has a serious purpose: my business is not about the structure of information, nearly as much as its flow. This carefully choreographed flow -- from content creation, through maintenance, automatically generating multiple versions of the content, the addition of user interface(s), various browser-specific enhancements and hacks, and utimately personalization -- makes the difference between a flexible, easily maintainable site, and one that requires a galley-full of HTML slaves working long hours in the faint hope of getting the content, the markup, the user interface, and the site structure all working together in synch.

In this article I hope to persuade readers that careful information choreography can make web publishing (and publishing in general) easy. Further, I hope to persuade readers that one version of a website is inadequate. I will illustrate a publishing infrastructure that makes publishing easy, and discuss why web publishing often seems difficult.

Finally, I'll use this analysis to make some prescriptive judgements about the future of HTML and the web.

The Revolution

It isn't just that everyone can afford a "printing press" (access to a web server). It's also that the publisher no longer controls the output. Yes, we can spec fonts and colors, but it will be to little avail if the user has a sixteen color monitor and Lynx. Perhaps a more subtle version of this point needs to be made: even in a pdf file, brilliantly subtle use of color may just be ridiculously unreadable when printed on a black and white laserprinter.

There's an even more profound change here, in the nature of automation. The question is: how many units must be produced in order to gain the value of automation? In the world of physical products, the answer may be very high. In the information publishing world, the answer may just be that one should automate the publishing of even a single version of a single file. The benefit you might gain is from maintaining the file over time. This is a new dimension to publishing: information is no longer "one-off" but gets re-used.

Carrying the automation case another step forward, information cries out to be tailored to consumer's needs. And yet it is too expensive to maintain different versions of essentially the same source material. So we will use XML to maintain different output versions within the same source, and we will use scripts to generate tailored versions of the information.

The Revolution in Publishing

1) no control of the medium -- the display, processing power, and bandwidth

2) the normative value of "access" to information -- no matter what the user's device is

3) the resultant need for multiple user-interfaces to the same content

4) the never-ending publishing model -- maintaining content over time

5) the advantages of tailoring the content for different users

No Control of the Medium

Note that with desktop publishing, the cost of creating camera-ready copy fell, and control moved closer to the designer. But distribution still required a printing press, and access was still a matter of geography and moving paper around (and paying for the transportation). The economics of information was still tied to the economics of printing and distributing the atoms.

The web removes the need for the printing press, and removes the need to transport atoms. The cost of reproducing information becomes approximately zero. Rather than buying our atoms as a delivery mechanism, consumers now provide their own devices.

And there-in lies the rub.

The web also removes the publisher's control over the presentation. We no longer get to choose the paper and exact colors. No matter what tags we put in the source, users may still look at our sites through sixteen shades of gray on a four-line monitor of a cellphone (or someday a toaster-oven).

The (Normative) Value: "Access"

Information is power. The network decentralizes information. Ultimately, we are in the business of reshaping the use of power on the planet.

In the meantime, though, if we are running a transaction site, our users may want to make their transactions (buy and sell stocks, for example) from any hardware/software/bandwidth situation they might find themselves.

Similarly, if I am supporting my site with advertising, I want all the "hits" I can get. It's a bad idea, then, to make my information inaccessible to anyone who wants it.

And if I am using my website to distribute either marketing or technical support information to existing or potential clients, it would be silly for me to demand that they have some arbitrary hardware/software/bandwidth before they can get the information I need to give them in order to make a sale or keep an existing customer satisified.

Before web "designers" there was a phrase: "the customer is always right"....

The User-Interface Versions

We need/want three versions of the user-interface. First, a low-bandwidth, very plain version. Second, a still-HTML version that looks pretty. Third, a whiz-bang version with all the bells and whistles we can throw down a wide pipe. We want the cutting-edge version, because we want to take advantage of extra value. We need a pretty, but basically simple version for the masses of people without fast connections and the lastest hardware and software. And we need the very basic version for people with antique desktop computers, or cutting-edge PDAs, or who are on the road in bandwidth-poor areas (no matter how fancy a system they have).

We might also want to use the same source files to generate a book and pdf files in both color and black & white.

The User-tailored Versions

This started, let's say, with online help. We had been shipping manuals. Then we needed to provide online help, but it had to be slightly different.

We might now want to generate distinct versions of a website for experts and novices. These different versions of the content need to be maintained in a single set of source files, or we just won't be able to afford to keep it up.

Maintaining two separate versions of information doesn't just double the work: you maintain each version, and then you have to verify that you are maintaining them consistently.

In other words, you lose if you don't automate the generation of the different content versions.

Never-ending Publishing

Time was you would finish a document in Pagemaker, send it off to the printer, and that would be the end.

It's different now. Documents now become "assets" that are worth maintaining over time. Even press releases are sent out, then touted on the website, then archived on the website. What happens when you launch a new "look and feel" for the website? You must republish the archives with the new user-interface! It never ends.

So now we need a publishing infrastructure that makes maintaining the content easy, and completely separates the content from the user-interface....

The Platform

1) One version of a website is not enough!

A little history here.... There was a time when there were many web browsers. Several companies were making browsers based on Mosaic, and had significant market share. Lynx was still in fairly widespread use. Netscape's strategy of developing new releases very quickly and adding non-standard tags and buggy features pretty well made it impossible for other browser makers to compete. No one else could move that fast. Rendering HTML according to the spec was not good enough; you had to reproduce Netscape's bugs and hacks. Netscape almost had a monopoly, until Microsoft entered the fray.

The last year or so we've been in a strange position of being able to design websites that would only work for 3.x and 4.x versions of browsers by Netscape and Microsoft. This has been difficult enough, because of the incompatibly buggy nature of these releases.

But, this is about to change. One of these days, a lot of people are going to reach the internet via their Palm Pilots and other PDAs. Others will reach with their cell phones. Opera is a worthy browser that may yet gain some market share.

Beyond this, there are bandwidth and hardware issues. I can make a website that will look very nice and have all the latest whiz-bang features, but it won't work for someone with a slow modem or a PDA. Similarly, I can make a very plain HTML site that will work for low-bandwidth connections, slow processors, monochrome monitors, etc., but why should people at their fancy desktop machines be limited in their use or enjoyment of my site because I want to give access to people with PDAs.

No, the only real solution will be to have three different user interfaces for websites. We'll be able to make one whiz-bang version with Java and Flash and Dynamic HTML, one version that still looks good for the masses with less bandwidth or not-the-newest browsers, and one version that works for low-bandwidth or small screens or slow hardware.

There is a "value" behind this: "access". The internet is about access. If you design your website in a way that I can't access it, you will lose my business if you are a transaction site, and you will lose my hit-count if you are expecting revenue from advertising.

And because you want me to be able to buy your goods or read your information through a cellphone when I'm on the road, or over a slow connection from a hotel in Prague, you will create different user interfaces to the same information.

And because you want to give me (your customer) access in all possible circumstances, your site will also be accessible to people who don't have the money for the fanciest hardware and high-speed connections. This means that children of poor countries around the world will benefit from the internet revolution, not just the individuals who can spend $2K/year on new hardware.

Up until now, I've only been focused on the user-interface/look-and-feel issues of the website. Another set of versions may be required because you are able to identify different target audiences with different content needs, or different ways that the information might be used.

A simple example is that of online help. You might want to allow people to indicate their level of expertise with the material; then deliver content tailored to their needs. Further, you might design your publishing system so that the context-sensitive help and the "manual" come from the same source -- so you don't have to maintain two different copies of what is essentially the same information.

2) Publishing should be as easy as typing a memo.

In the previous section I propose the need (and opportunity) for multiple versions of the content of a website.

And we all know that, generally, web publishing is so difficult for people now that they can't even make one version of their site with high quality (and time for a social life and sufficient sleep).

And here I am saying that publishing multiple versions of a website should be as easy as typing a memo in Microsoft Word....

Yes. Yes, yes, yes.

This is my business. I know it can be done. There are commercial packages that approach this, even now.

How?

As we've said all along: separate the "content" from the "user-interface". Create and maintain the content in the easiest possible form, then automate the integration with the user interface.

We've all done this to some extent with frames. The content file can be very plain; doesn't need any navigation.

Frames pose lots of problems, and the user-interface is not just the navigation. You also have background colors and font hacks, etc. But frames let you go a long way in this direction (at a high cost).

I'm saying automate the process of combining the content with the user interface.

More than that, I'm saying that if you do that once, you can do it three times at very little marginal cost. You still have to design three versions of your site, but you only maintain one version of your content, and generate the three published versions automatically.

This can be done "on-the-fly", if you like. There's a pretty high cost to doing that for otherwise static content, but it can be done. (It's just that I would think that information management technology would allow us to generate the multiple versions ahead of time, and not get lost with the number of files, etc.)

There are two cases where rendering multiple versions on-the-fly loses: when the content is different between versions, and when the user interface is thought of as more than the navigation.

Let's say that I really do tag my source files so that some paragraphs are "expert-only" and some are "novice-only" (or similar distinctions).

It would be silly to parse each content file character by character on-the-fly, when this could all be done in advance. The cost of this is significantly higher than just adding the navigation around a file that isn't parsed.

Similarly, if I want to create three different user-interfaces to my content, they might well contain different characters. If one version uses stylesheets for font information, it's advantageous not to use the (hack-ish) font tag (that must be inserted inside of every table cell, and doesn't affect the size of text within heading tags). Beyond that, maintaining font tags in the source files is very difficult and expensive: it's too easy to forget one and they obscure the actual content of the source.

Maintenance is easier if the font tag hacks are added in the publishing phase -- but you wouldn't do that on-the-fly.

A similar example is maintaining the image height and width attributes. It's silly and error-causing to maintain these attributes by hand. But it would be silly to parse files on-the-fly, check the actual pixel height and width of each image and insert that into the tags.

3) Focus on Ease of Maintenance

The way you get quality of content and execution, is to make the execution so easy that you have time and attention available to focus on the quality.

The way to make execution easy is to make the ongoing maintenance of the site easy: easy to succeed, in contrast to many web publishing environments where it is easy to fail; easy to look like an idiot.

Publishing infrastructure....

The entire publishing system should be designed with the focus on ease-of-maintenance, and automation of each discrete step in a long series of processes.

There are phases: Create, maintain (including source control and separate workareas), generate versions of the content, merge content with one or more user-interfaces, the site database, apply required browser-specific hacks, apply additional personalization and/or direct the user to the appropriate version of the content.

Creating content

Creating (text) content can be done by writers using word processing software, with carefully designed templates. The templates and macros make it easy to convert from the word processor format to ascii XML. The writers must be willing to use the templates and use the styles appropriately, but if they cooperate, creating content and converting it to XML source will be easy and painless.

Maintaining Content

Obviously we look forward to the day when word processsors will process XML, so the content can be maintained there as well. Until then, it is fairly easy to make small changes to XML/HTML source files -- that are stripped of the user interface: page layout, navigation, and font hacks.

Source Control

Storing the source in ascii files makes it easy and cheap to use source control. RCS and such are not pretty, but they are free, and anyone who is expected to work with HTML tags can surely write down the four or five commands they need to use.

Actually, inadequate source control is one of the problems of web publishing. Most of the people doing this really don't understand the need, or the process.

You must not only keep the source files in some source control system, you must give each maintainer a sepearte work area. And you must be careful about versions. Even as you move forward with major revisions, you will still need to make some changes to the current version of your website.

Let's say one more thing about source control. The needs of web content publishers are different from the needs of programmers. This is really an empirical question: programmers working on a few large source files need to be able to let several people have access to the same file at once. As long as they are working on different parts of the (large) file, it is very easy to merge their changes together again. And programmers have skills at this kind of thing.

Meanwhile, people working with HTML need to be sure that if one person is editing a file, no one else can edit it. So "file-locking" is very important to source control for web publishing, even as it is important for programmers not to be locked out of files.

Versions of the content

By using XML tags, you can mark text in a way that lets you maintain one set of source files, but (automatically) generate multiple versions of the content.

A simple example, again, is help: context-sensitive help, by definition, would not include the steps required to get to the current screen. A manual that is procedural would include those steps. You can either maintain two different files, or you can mark the content with XML so that the different versions get the right text, from a single source files.

Other versions: split documents into short files with large fonts that can be read online. Also have a version of the entire document in one file for people to print, or just scroll through. (A single file allows users to search using the browser's Find feature.) You could also create a pdf file for people to download and print -- potentially a better quality printout than HTML.

Maintainable source

The newness of the technology and the inexperience of the practitioners has led to a number of ghastly practices. While any individual might express a preference for HTML tags in upper or lower case, the empirical results have been understood for at least 70 years: words in all upper case are harder to read. There is less differentiation between the shapes of the characters. It is easier to see errors if HTML tags are in lower case. Similarly, it doesn't take a bear of much brain at all to notice that typing tags in all upper case requires more effort, and more attention. Effort and attention are too scarce to squander on whims that lead to lower quality. The same applies to quoting attribute values where not required. It is easier by far to quote attributes only when the quote is required than to remember to close the quote of every attribute.

Another horrendous practice is indenting the HTML source to show the heierarchy. Such a "neat" technology; such a waste of screen space. Before you know it there is more white space at the start of the line than text on the line that you can see. Give us a break!

You really can make a significant difference in the maintainability of your HTML source by using whitespace carefully -- both vertical and horizontal.

And then there are comments. Use them liberally to make nested tables understandable. Use them quite carefully to make automation easier. Use them to identify discrete sections of your document.

If you use source control, you never have to have those stupid comments: "Last updated by XKX on 99/99" -- that never seem to get modified (because people have better things to do with their time that waste it).

We are dealing with information technology. You would think people would start learning.... You should really figure out how to do search and replaces across many files at once. And if you identifies chunks of text (perhaps with comments like ...) you can make that kind of search and replace a lot easier than trying to type all the characters you want to get rid of, and then all of the characters you want to replace them with.

Applying the user-interface(s)

The same script that adds page layout, global and parallel navigation elements to the source files might reach into the each table cell to add font information, might insert the right image height and width tags. If you use some server tool with long URLs containing more information than just which file to deliver, you might add the extra URL information at this phase, as well.

The site database

Speaking of parallel navigation, where does the information about the order of files live? Is it distributed between the files themselves, or is it in a database? The right answer is that the order of files should be in a database.

The same database can then store meta information. This way you can maintain meta information (for indexing, etc.) in a central location, with the control that is appropriate to database information. At publishing time, you can insert meta information into the files, and you can build a complete table of contents/sitemap from the database, instead of updating it by hand.

Apply browser-specific hacks

The sad fact is that the variation between versions of browsers (especially with Javascript) is such that it is just better to deliver exactly the code appropriate for the end-user's browser. This also includes testing for whether the user has cookies, javascript, java, etc. turned on or off.

This is a great use for on-the-fly content tools. Have some part of the content dependent on browser characteristics. Give the Javascript hack that works for the particular browser, etc.

Apply personalization/direct to version

Yes, you can insert "Welcome, Joe Blow" on every page of your site. Maybe Joe will be impressed. ("Golly, how did they know it was me?") More usefully, you can allow Joe to tailor what information appears. And this same layer can make sure Joe is getting to the most appropriate version of the content. So Joe can see the whiz-bang version at work, but the plain version from his PDA. In either case, he always gets the "expert" version of the content, and his personalized information set (pulled from databases).

"Productizing the Choreography

It's a long way from the text someone typed in Word to the particular version of the site that someone sees. But it can all be scripted. Content maintainers maintain the source, and everything else is added on by scripts.

A low-tech approach combines free source control software, managing content in files (not a database), managing data about the content in a simple database, and scripting with Perl or some other easily accessible language.

The sad truth is that the low-tech approach is very difficult to "productize". It's just assembling a set of proven, robust components in a sensible way.

To make a product out of this -- to attract more customers and venture capital -- you have to think on a much bigger scale. You pretty much have to store the information in a database of your own. That means you have to write your own source control system -- a no-brainer, except for "diffing". But then you need easy access to the information hidden in your big content database, so you need your own editing. And there are issues about how you merge the content with the templates, and even how you create the templates.

So products in this field must be large endeavors: Net Objects Fusion; the database system CNET uses.

For the extra "productizability", they are typically more difficult to work with generally, are built from scratch with less robust systems, and are usually given graphical user interfaces that may or may not have been designed by extraordinarily talented ui-engineers/designers.

And still you have to script websites in order to get the parallel navigation. And you have to use their brand-new, written-from-scratch scripting languages. (As if Javascript wasn't already enough of a nightmare....)

My point here is that one shouldn't confuse product and technology. A good source control product won't help a wit if people don't cooperate -- play by the rules.

Why is Web Publishing so Hard?

The first, easiest answer is that it is hard because people aren't doing what it takes to make it easy. Most websites have no automation of the publishing process, and the source files they maintain are the same as the files that are live on their site.

Beyond that, let's say that the "players" have no experience in this radically different and new publishing environment.

Who are the "players"? Look around the web companies: former desktop publishers, database folks, programmers, people who couldn't make it in the CD-ROM business, and decision-makers with no experience in the new environment.

Former Desktop Publishers

These are the people who first started using "wysiwyg" tools, then switched to BBEdit. Their business was always tweaking objects in Pagemaker until it was just right. If the website is small enough, they can still tweak every object, but maintenance becomes very difficult. These were "mouse-y" people, with no sense of automation, and no sense of re-using content (besides an occassional graphic).

If a website is large, or needs multiple versions, or changes frequently, the tweaky people will work maniacal hours. You see this at websites all over.

Database Folks

For some reason, they think their experience with small numbers that can be summarized mathematically applies to publishing large volumes of text. SQL databases are lousy at text content, and outrageously expensive to use in terms of cost per delivered page. Serving content out of databases also adds significantly to the delivery time, and makes the publishing system that much more fragile. What's more reliable (or cheaper): serving flat HTML files from a disk or connecting to some program that queries a database to return information?

Fundamentally, tey have confused a product category -- SQL databases -- with a technology -- managing units of information. Discrete files in a directory is a fine way of managing large quantities of text, especially with easy access and control provided by off-the-shelf editing and source control software.

Then again, transaction-oriented websites clearly need databases. It's just that the transaction dog wags the tail of the web publishing group: the transaction dog is where the money and big hardware requirements are. The poor publishing guys are stuck with systems designed for the convenience of transaction processors, and maintenance becomes nightmarish.

Programmers

"#include". That's about what programmers have to offer web publishing. Or you get schemes of combining program code with HTML tags -- designed by programmers without regard for the maintenance nightmares they create for the web publishing group. Microsoft's Active Server Pages are an example of a programmer-oriented technology that mercilessly mixes program code (server-side VBScript) with HTML to create an unmaintainably twisted tangle.

Perl programmers have about the same idea -- that HTML isn't very important compared to the script, and can just be interspersed with with Perl. Then some poor wretch with little or no Perl knowledge is stuck going through the Perl source trying to figure out how to change some nested table structure....

Even Kiva, with it's potential for greater separation of program code and HTML can be used in such a way that the HTML source is a bear to maintain. Perhaps the key question is whether the layers of templates and code used with Kiva are more independent, or are creating more dependencies. My guess is that it could be done well with Kiva, but programmers have no sense of the maintenance needs of the HTML, and so -- without firm guidance -- will revert to practices that make maintenance more difficult, rather than less.

Decision Makers

We all went through a horrible period when desktop publishing emerged: managers with no experience in publishing were suddenly riding an explosive wave of small-scale publishing. The result was an incredible array of "bozo" documents with mutating fonts, faces, sizes, and colors -- painfully ugly and unreadable.

Well, someone, somewhere is burning off some bad kharma, because it's happening all over again, only more extreme. Now, all of a sudden, managers whose primary qualification was (perhaps) the inability to actually do anything, are making decisions about interaction models and information design for the web?!???!

They have no experience with web publishing, and have no ability to draw distinctions about what is important or unimportant, what works or doesn't. (Note that this is also true for most of the people designing websites.) Further, the presentations upon which they base their approval decisions focus on "the look and feel" -- static visual effects -- rather than the actual usefulness of the proposed site.

And the result is what we see on the web today.

You might work with some amazingly sharp people who can create excellent websites, but the decisions are made by people who don't even know how clueless they are.

CD-ROM wannabe's; wish-they-could-have-been's

Forgive me for expressing this harshly. We all are seduced by multimedia: the glitz, the motion, the video, the interaction. Most people working on the web hate HTML. They don't want to do cross-platform information delivery. They want to make sexy CD-ROMs. (I want to make sexy CD-ROMs.) They don't want to have bandwidth constraints, and they don't want there to be a time lag between a user's mouse-click and the display of the next screen (with full-motion video).

But there are two harsh points to make about this. First, obviously, it is the web.

Second, obviously, they weren't good enough to make CD-ROMs. You have to understand media, and you have to understand automation. You can't even begin to tweak every chunk of information (like you could in Pagemaker). Video and animation are "like drinking from a fire-hose". And you have to be good. And the people working on the web -- empirically -- weren't.

So we have all these wanna-be's with their CD-envy running around making disasterously bad websites.

Worse, where a bad CD-ROM at least does no damage, a bad website wastes bandwidth, and negatively affects the experience of other Internet users who have never been to the offending website. It's like a car maker with no constraints on how much pollution it releases into the air....

A Retraction

Please understand, my point is not that any of the individuals above are incapable of doing quality work.

My point is that the web is new and very different, so people are confused about what is important. In the long run, they will be able to see the obvious and will -- gradually, gradually -- learn to draw the important distinctions required for making good decisions.

Further Praise for Pre-processing

Let me simply make two lists: what is scarce, and what is plentiful. It should be obvious that moving as much processing away from the live web server is essential.

What's scarce:

skills
attention
support
peak server cycles
QA resources
talent
experience

What's plentiful:

disk space
pre-processor cycles

Every processor cycle you can off-load from the web server has value.

Everything you can do to simplify the delivery has great value. Just think of the things that can go wrong: not much with an Apache server chunking out flat HTML files from the disk drive.

But put more layers in the mix -- server-side processing, database connections, etc. -- and you make your system more fragile, not just because of the number of processes that must be working correctly to deliver an HTML page, but also because of the greater fragility of these other processes. The talent and resources required to maintain those additional resources are very expensive, and very scarce. One small mis-alignment in any of the systems, and the whole business can come crashing down. Bad.

On the other hand, skills and attention are in shorter supply than server cycles. So my one caveat is that one should minimize the number of different technologies used for web publishing within an organization. Even if there is a more efficient technology for any one application, it may well be worthwhile to use the same tools all across the board. You can't expect good results when the person maintaining a project must master, and constantly switch between, four or five different page rendering technologies (server-side includes, HTML pages with Perl embedded, CGI applications, Java servelets, and a pre-processor, for example).

The future of HTML

From my perspective, I need for word processors to be XML-friendly (for real, not for marketing hype).

I need it to be easy to go from XML structure with simple HTML markup, to various versions of output. In addition to HTML, I want to produce books and pdf files out of this system. Even for forms, I may want to have scripts read my XML source and generate both HTML and Java versions of forms.

I'd say the key to useful XML is that I should be able to use multiple DTDs (with distinct sets of tags) within the same file. So my XML parser should skip over any tags not in the DTD I'm testing against.

I need for HTML to be easy to maintain, stable over time, compatible across vendors, platforms, and versions.

But most of all I want a system where only people responsible for the user-interface spend much time looking at HTML source.

So my manifesto makes a couple of points. First, let's get smart about the publishing infrastructure, realizing we must be able to create multiple versions of our content.

Second, let's stabilize HTML as a key output medium.