Information Technology Daily: How to Succeed With URLs

Wednesday, April 21, 2004

How to Succeed With URLs

by Till Quack
If you’re building or maintaining a dynamic website, you may have considered the problem of how to get rid of unfriendly URLs. You might also have read Bill Humphries’s ALA article on the topic, which presents one (very good) solution to this problem.

The main difference between Bill Humphries’s article and the solution I will present here is that I decided to do the actual URL transformations with a PHP script, whereas his solution uses regular expressions in an .htaccess file.

If you prefer working with PHP instead of using regular expressions, and if you want to integrate your solution with your dynamic PHP sites, this might be the right method for you.

Why worry about URLs?
Good URLs should have a form like /products/cars/bmw/z8/ or /articles/january.htm and not something like index.php?id=12. But the latter is the kind of URL most publishing systems generate. Are we stuck with bad URLs? No.

The idea is to create “virtual” URLs that look nice and can be indexed by bots (if you set your links this way also) – in fact, the URLs for your dynamic content can have any form you like, but at the same time static content (that might also be on your server) can be reached by its regular URL.

When I built my new site, I was looking for a way to keep my URLs friendly by following these steps:

A user enters a URL like www.mycars.com/cars/bmw/z8/
The code checks to see if the entered URL maps to an existing static HTML file
If yes, the file is loaded, if no, step 4 is executed
The URL string is used to check if there is dynamic content corresponding to the entered URL (e.g. in a database).
If yes, the article will be displayed
If no, an Error 404 or a custom error message will be displayed.
A Collection of tools
This article will provide you with all the information necessary to implement this solution, but it’s more a collection of tools than a complete step-by-step guide to a finished solution. Before you start, make sure you have the following:

mod_rewrite and .htaccess files
PHP (and a basic understanding of PHP programming)
a database like mySQL (optional)
The index takes it all
After browsing the web and checking some forums, I found the following solution to be the most powerful: All requests (with some important exceptions – see below) for the server will be redirected to a single PHP script, which will handle the requested URL and decide which content to load, if any.

This redirection is done using a file named .htaccess that contains the following commands::

RewriteEngine on
RewriteRule !\.(gif|jpg|png|css)$ /your_web_root/index.phpThe first line switches the rewrite engine (mod_rewrite) on. The second line redirects all requests to a file index.php EXCEPT for requests for image files or CSS files.

(You will need to enter the path to your web-root directory instead of "your_web_root". Important: This is something like "/home/web/" rather than something like "http://www.mydomain.com.")

You can put the .htaccess file either in your root directory or in a sub-directory, but if you put the file in a sub-directory, only requests for files and directories "below" this particular directory will be affected.

The magic inside index.php
Now that we’ve redirected all requests to index.php, we need to decide how to deal with them.

Have a look at the following PHP Code, explanations follow below.

//1. check to see if a "real" file exists..

if(file_exists($DOCUMENT_ROOT.$REQUEST_URI)
and ($SCRIPT_FILENAME!=$DOCUMENT_ROOT.$REQUEST_URI)
and ($REQUEST_URI!="/")){
$url=$REQUEST_URI;
include($DOCUMENT_ROOT.$url);
exit();
}

//2. if not, go ahead and check for dynamic content.
$url=strip_tags($REQUEST_URI);
$url_array=explode("/",$url);
array_shift($url_array); //the first one is empty anyway

if(empty($url_array)){ //we got a request for the index
include("includes/inc_index.php");
exit();
}

//Look if anything in the Database matches the request
//This is an empty prototype. Insert your solution here.
if(check_db($url_array)==true()){
do_some_stuff(); output_some_content();
exit();
}

//3. nothing in DB either Error 404!
}else{
header("HTTP/1.1 404 Not Found");
exit();
}
Step 1, lines 1-9: check to see if a “real” file exists:
First we want to see if a existing file matches the request. (This might be a static html file but also a php or cgi script.) If there is such a file, we just include it.

On line 3, we check to see if a corresponding file is in the directory tree using $DOCUMENT_ROOT and $REQUEST_URI. If a request is something like www.mycars.com/bmw/z8/, then $REQUEST_URI contains /bmw/z8/. $DOCUMENT_ROOT is a constant which contains your document root – the directory where your web files are located.

Line 4 is very important: We check to see if the request was not for the file index.php itself – if it were, and we just went ahead, it would lead to an endless loop!

On line 5, we check for another special case: a REQUEST_URI that contains a "/" only – that would also be a request for the actual index file. If you don’t do this check, it will lead to a PHP Error. (We will deal with this case later on.)

If a request passes all these checks, we load the file using include() and stop the execution of index.php using exit().

Step 2, lines 14-28: check for dynamic content:
First, we transform the $REQUEST_URI to an array which is easier to handle:

We use strip_tags() to remove HTML or JavaScript tags from the Query String (basic hack protection), and then use explode() to split the $REQUEST_URI at the slashes ("/"). Finally, using array_shift(), we remove the first array entry because it’s always empty. (Why? Because $REQUEST_URI always starts with a "/").

All the elements of the request string are now stored in $url_array. If the request was for www.mycars.com/bmw/z8/, then $url_array[0] contains "bmw" and $url_array[1] contains "z8." There is also a third entry $url_array[2] which is empty – if the user did not forget the trailing slash.

How you deal with this third entry depends on what you want to do; just do whatever fits your needs.

What if that $url_array is empty? You may have realized that this corresponds to the case of the $REQUEST_URI containing only a slash ("/"), which I mentioned above.

This is the case when the request is for the index file (www.mycars.com or www.mycars.com/). My solution is to just include the content for the mainpage, but you could also load an entry from a database.

Any other request is now ready to use. At this point your creativity comes into play – now you can use the URL elements to load your dynamic content. You could, for example, check your database for content that matches the query string; this is sketched in pseudo code on lines 25-28.

Suppose you have a string like /articles/january.htm. In this case, $url_array[0] contains “articles” and $url_array[1] contains “january.htm.” If you store your articles in a table "articles" that includes a column "month," your code could lead to a query like this:

str_replace (".htm","", $url_array[1]);
//removes .htm from the url
$query="SELECT * FROM $url_array[0] WHERE
month='$url_array[1]'";You could also transform the $url_array and call a script, much as Bill Humphries suggests in his article. (You need to call the script via the include() function.)

Step 3, lines 30-32: nothing found.
The last step deals with the case that we neither found a matching static file in step one, nor did we find dynamic content matching the request – that means that we have to output an Error 404. In PHP this is done using the header() function. (You can see the syntax to output the 404 above.)

Beware of hackers
One part of this procedure creates a few vulnerabilities. In step one, when you check for a existing file, you actually access the file system of your server.

Usually, requests from the web should have very limited rights, but this depends on how carefully your server is set up. If someone entered ../../../ or something like /.a_dangerous_script, this could allow them to access directories below your web-root or execute scripts on your server. It’s usually not that easy, but be sure to check some of those possible vulnerabilities.

It’s a good idea to strip HTML, JavaScript (and maybe SQL) tags from the querystring; HTML and Javascript tags can easily be removed using strip_tags(). Another wise thing to do is limit the length of the query string, which you could do with this code:

if(strlen($REQUEST_URI)>100){
header("HTTP/1.1 404 Not Found"); exit;
}If somebody enters a query string of more than 100 symbols, a 404 is returned and the script execution is stopped. You can just add these (and other security related functions) at the beginning of the script.

How to deal with password protected directories and cgi-bin
After I had implemented the whole thing, I realized that there was another problem. I have some password protected directories, e.g. for my access statistics. When you want to include a file in one of these directories, it won’t work because the PHP Module has a different user which cannot access this directory.

To solve this problem, you need to add some lines to your .htaccess file, one for each protected directory (in this example the directory /stats/):

RewriteEngine on
RewriteRule ^stats/.*$ - [L]
RewriteRule !\.(gif|jpg|png|css)$ /your_web_root/index.phpThe new rule on the second line excludes all access for /stats/ from our redirection rule. The "-" means that nothing is done with the request, and the [L] stops execution of the .htaccess if the rule at this particular line was applied. The original rule on the third line is applied to all other requests.

I recommend the same solution for your cgi-bin directory or other directories where scripts that take GET queries reside.

fr.: http://alistapart.com/articles/succeed

Information Technology Daily

Wednesday, April 21, 2004

How to Succeed With URLs

0 Comments:

Previous Posts

Links