J. R. Boynton

URLs are Abstract

Many people – even in the web business – imagine urls aim directly at a web page. For example that http://www.somewhere.com/some_dir/some_file.html is a straight-forward indication of where the content of "some_file.html" is. Implicitly, they assume that "some_file.html" actually is an html file.

In fact, there are so many layers of abstraction between a url and a piece of information that it makes your head spin.

The reason this is somewhat important is that the value of websites and the entire web is enhanced by maintaining urls over time. Changing urls reduces the value of the web, even where management or marketing geniuses think they have perfect reasons to change their website. The abstract nature of urls means that you can define your urls once, and keep them, even if you change your software somewhere down the road. It means software should avoid using urls that lock customers in. (Ah, there's a difference in what's good for the corporation and what's good for the world.)

And if you do want to change, you can always replace the old urls with pointers to the new urls.

Here's a sampling of the sorts of abstraction that may come between the url and the content.

First, the domain name is converted to an ip address. This usually happens by way of DNS, which is the way ip addresses are normally mapped to domain names across the internet.

On the other hand, desktop computers will have a file the operating system looks at before it asks for a DNS lookup. In your /etc/hosts file, you can map any ip address to any hostname. For example, you could save yourself from advertising servers by mapping their hostnames to some other ip address. Or in your internal environment, you could map ip addresses on your lan to hostnames.

Once a request goes to an ip address, the device there is usually a router, not a web server. The router probably forwards request to a load balancer, which then picks an actual web server.

An opposite approach that is also frequently used is that one web server responds to many ip addresses. Web servers also support "virtual hosts" which is where many hostnames and domains are mapped to the same ip address, and the web server uses the hostname in the request to decide which content to deliver.

Once the correct hostname is reached, there is typically a "document root" directory. For the Apache web server, this is usually called "htdocs" and there is usually one at /usr/local/apache/htdocs. But if the server has many virtual hosts, the htdocs directories for the virtual hosts can be anywhere.

Once you get beyond the htdocs directory, the paths the server will answer to don't have to be under the htdocs directory. For example, in /path/file.html, path doesn't actually have to be under htdocs. It can be defined in the web server configuration file, or "path" could be a link to some other directory in the file system.

Web servers typically have a built-in mechanism for translating a url request to a file system path. The translation could be very simple, or very complex.

Web servers typically allow for plug-ins, that can give web teams even more flexibility in defining the translation between the url and the actual resource. The resource doesn't have to be a file, either. You could have your plug-in translate /articles/grassroots.html to /cgi-bin/articles.cgi?cid=251. The whole world would think there is a file called grassroots.html in the articles subdirectory, when actually, a cgi program delivers articles by content id.

And if you don't like plug-ins, you can get the same effect with server-side includes. You could actually have a file at /articles/grassroots.html, but the file could contain only a server-side include: /cgi-bin/articles.cgi?cid=251.

By now it should be obvious that the url also need bear no relation whatsoever to the filename and directory where an article was first written, and no relation to the location where someone would maintain it.




Copyright © 1998-2011 J. R. Boynton