Wednesday, April 21, 2004

URLS! URLS! URLS!

by Bill Humphries

Looking around the web, you’ve run across plenty of URLs that look like:

/content.cgi?date=2000-02-21/article.cgi?»
id=46&page=1Server side scripts generate the content of those pages. The content of a particular page is uniquely determined by the URL, just as if you requested a page with the URL /content/2000-02-01.html or /article/46.1.html. These pages are different than server-generated pages created in response to a form like a shopping cart, or enrollment. However, search engines will not index these content pages, because search engines ignore pages generated by CGI scripts as potential blind alleys.

A search engine would follow a URL like

/content/2000/02/21,

so some way of mapping a URL like /content/2000/02/21 to the script /content.cgi?date=2000-02-21 would be useful. Not only will search engines follow such a link, but the URL itself is easy to remember. A frequent visitor to the site would know how to reach the page for any day the site published content. When I changed the interface for viewing entries by topic in my WebLog from /meta.php3?meta=XML to /meta/XML, search engines such as Google started indexing, and I’m getting more visits referred by search engines.

The trick is to tell the outside world that your interface is one thing: /content/YYYY/MM/DD, but when you fetch the page, you’re accessing /content.cgi?date=YYYY-MM-DD. Web servers such as Apache and content management systems such as Userland’s Manila and the open source Zope support this abstraction.

The abstraction is also useful because a site’s infrastructure is rarely stable over time. When engineering replaces the Perl CGI scripts with Java Server Pages, and the URLs become /content.jsp?date=YYYY-MM-DD, your users’ bookmarked URLs break. When you use an abstraction, your users bookmark /content/YYYY/MM/DD, and when you change your back end, you update /content/YYYY/MM/DD to point at /content.jsp?date=YYYY-MM-DD without breaking bookmarks.

If you’re not publishing content dynamically, and have URIs like:

/content-YYYY-MM-DD.html,

you don’t have the problem with indexing that the dynamic content has. However, you still may want to adopt this type of URI for consistency with other sites. Remember people coming to your site want to use an interface they are familiar with, and URIs are part of your interface.

Rewriting the URL in Apache
The Apache Web server is ubiquitous on both Unix and NT, and it has an optional component, mod_rewrite, that will rewrite URLs for you. It isn’t part of the standard install. Pair Networks, Dreamhost, and Hurricane Electric, have it enabled on their servers. If you are running your own server, check with your systems administrator to see if it’s installed, or have her install it for you.

The mod_rewrite module works by examining each requested URL. If the requested URL matches one of the URL rewriting rules, that rule is triggered, and the request is handled by the rewritten URL.

If you’re not familiar with Apache, you’ll want to read up on how its configuration files work. The best place to run mod_rewrite from is your server’s httpd.conf file, but you can call it from the per directory .htaccess file as well. If you don’t have control of your server’s configuration files, you’ll need to use .htaccess, but understand there’s a performance hit because Apache has to read .htaccess every time a URL is requested.

The Goal
The goal is to create a mod_rewrite ruleset that will turn code such as that shown below:

/content/YYYY/MM/DDinto a parameterized version such as is shown next, or into something similar, as long as it’s the right URI for your script.

/content.cgi?date=YYYY-MM-DDThe Plan
We start with the URI /content/YYYY/MM/DD and want to get to /content.cgi?date=YYYY-MM-DD. So we need to do a few things:

Recognize the URI
Extract /YYYY/MM/DD and turn it into YYYY-MM-DD
Write the final form of the URI /archives.cgi?date=YYYY-MM-DD
Regular Expressions and RewriteRule
This transform will require two of the directives from mod_rewrite: RewriteEngine and RewriteRule. RewriteEngine’s a directive which flips the rewrite switch on and off. It’s there to save administrators typing when they want or need to disable rewriting URLs. RewriteRule uses a regular-expression parser that compares the URL or URI to a rule and fires if it matches.

If we’re setting the rule from the directory it fires using the .htaccess file, then we need the following:

RewriteEngine On
RewriteRule ^archives/([0-9]+)/([0-9]+)/([0-9]+)»
archives.cgi?date=$1-$2-$3What that rule did was first match on the string ‘archives’ followed by any three groups of one or more digits (the [0-9]+) separated by ‘/’s, and rewrote it as archives.cgi?date=YYYY-MM-DD. The parser keeps a back reference for each match string in parentheses, and we can substitute those back in using $1, $2, $3, etc.

If your page has relative links, the links will resolve as relative to /archives/YYYY/MM/DD, not /archives. That means your relative links will break. You should use the base element in the head of the page to reanchor the page.

RewriteRule for Static Content
If you have a series of static HTML files at your document root:

/content-1999-12-31.html
/content-2000-01-01.html
/content-2000-01-02.html...and want your readers to access them with URLs like /archives/1999/12/31, then you would need a rewrite rule at the document root, such as:

RewriteRule ^archives/([0-9]+)/([0-9]+)/»
([0-9]+)$ /news-$1-$2-$3.html
RewriteRule ^archives$ /index.htmlIf the news-YYYY-MM-DD.html files are in a folder called /archives, the rewrite rule should be:

RewriteRule ^/archives/([0-9]+)/»
([0-9]+)/([0-9]+)$ /archives/»
news-$1-$2-$3.htmlIf you want to use an .htaccess file at the archive folder level, then the rule becomes:

RewriteRule ^([0-9]+)/([0-9]+)/»
([0-9]+)$ news-$1-$2-$3.htmlAlso, you may delete the second rewrite rule since you can use a DirectoryIndex rule instead.

DirectoryIndex index.htmlCorner Cases
What if someone enters http://www.yoursite.com/archives instead of http://www.yoursite.com/archives/YYYY/MM/DD? The rule is that mod_rewrite steps through each rewrite rule in turn until one matches or no rules are left. We can add another rule to handle that case.

RewriteEngine On
RewriteRule ^archives/([0-9]+)/([0-9]+)/([0-9])+»
archives.cgi?date=$1-$2-$3
RewriteRule ^archives$ index.htmlIn this case, redirect to an index page. But you could redirect to a page that generates a search interface.

What If My Server’s not Apache?
Unfortunately IIS does not come with a rewrite mechanism. You can write an ISAPI filter to do this for you.

If you are running the Manila content management system that comes with Userland’s Frontier, the options allow you to map a particular story in the system to a simple URL.

The Zope publishing system also supports mapping of paths into arguments for server scripts.

References
Good URLs are part of interface design. Jakob Nielsen discusses this in his Alertbox column: http://www.useit.com/alertbox/990321.html.

This article was inspired in part by Tim Berners-Lee’s observation that good URLs don’t change: w3.org/Provider/Style/URI

Rafe Engelschall has many examples of mod_rewrite in ‘cookbook’ form at his site: http://www.engelschall.com/pw/apache/rewriteguide/.

I use these techniques to create a standard interface to my weblog.

Bill Humphries has been developing for the web since 1995. He runs the More Like This weblog covering XML, web publishing and whatever other esoteric items he likes. By day, he works for a company in Silicon Valley where he helps push the web onto wireless devices.

fr.: http://alistapart.com/articles/urls

0 Comments:

Post a Comment

<< Home