Using HTTP Headers to Optimize CGI
by Jay Lorenzo
Let's take a close look, from a HTTP perspective, at how an HTML document is returned from a server to a browser. The Web server receives a browser's request via HTTP for a particular document. After ascertaining the availability of the document, it generates an HTTP response. The initial data in the response is referred to as the HTTP header block, which provides information that can be used to assist the current request. In this hypothetical case, the response includes a HTTP status code of 200, indicating a successful request. Following that is a section that focuses on the content delivered, in this case containing "Content-type: text/html", which identifies the MIME content type of the file being sent to the browser. Additional information in the header may contain a timestamp of when the transaction occurred, and the identity of the server software being used. The header is a necessary component in this transaction, as it assists and instructs the Web browser on the correct way to interpret and display the data being sent to it. The end of the header block is indicated by the addition of a blank line, which in turn is followed by the contents of the file requested.
When you write Common Gateway Interface (CGI) programs, you can override all or part of the HTTP header that is normally returned by the server. This allows you to directly interact with the Web browser and provide the information that is needed to process a given document or document part. To better understand what information can be contained in HTTP headers, check out this article from the W3C, which provides a comprehensive listing of the headers currently defined in HTTP. Additionally, some browsers also implement proprietary headers, such as the Refresh: (used commonly for client pull) and Set-Cookie: client-side directives used by Netscape Navigator and Microsoft Internet Explorer v2.0 and above.
Location, Location, and Location
The Location: header is used as a means for redirecting a browser from its original request to an alternate document, and is responsible for a process known as "server redirection." The original intent of this header was to allow a server to offer an alternate URL for a requested document, presumably because the original document had been moved, was superseded by another document, or was no longer available. Once the browser receives the redirected location, it then makes a request for the new URL without any additional intervention by the browser user. The Location: header is usually sent with no other header information, as the balance of the reply will not be used by the browser.
Web developers eventually realized that the Location: header was a great mechanism to return a static document at the conclusion of an interactive session. It is always a good design practice to provide feedback as to the status of a particular interactive session. When a visitor to your site completes a transaction, your CGI program can use the Location: header to redirect the browser to a document that provides a standard message for a response. This method is useful when you believe the contents of the reply message will change from time to time. By using redirection, it is no longer necessary to rewrite your script to output a different message, since the response document itself can be changed. Take a look at redirect1.html to view a short example in Perl that describes this type of interaction.
Location Tips and Tricks
The Location: header also can be used in CGI programs as an aid in optimizing server performance. When serving browser-specific documents, it is a common strategy to use a Perl script or other CGI program as a placeholder for the root document for a Web site. When a particular Web browser requests the root document, the script captures and decodes the HTTP_USER_ AGENT variable, and based on its value, calls a subroutine that opens and outputs a file containing HTML optimized for that specific browser, then closes the file. A more efficient twist on this strategy is to eliminate the file output subroutine and replace it with a Location: header to redirect the browser to a file that contains data optimized for that browser. This will usually result in improved memory efficiency on heavily loaded servers, and is particularly beneficial for servers that cache currently requested documents, as there is no extra file system overhead required to return the redirected document.
Click on redirect2.html and depending on your browser's identity, or MIME content capability, you will be delivered a particular document through server redirection. The source for the associated CGI program can be found at redirect2_perl.html.
Note that in this example we also use the evaluation of the HTTP_ACCEPT environment variable in conjunction with the more commonly used HTTP_USER_AGENT. The HTTP_ACCEPT variable can tell us what graphic MIME types the browser is capable of using. Because the number and complexity of browsers appearing in the Web marketplace is increasing, it eventually will become inefficient to base your content delivery solely on what is reported by HTTP_USER_AGENT, as there will be a good number of browsers that have capabilities unknown to you. This will become particularly important as the Web evolves into a database-generated model. Documents that are assembled from various sources "on the fly" take a hit on your server/database engine no matter what browser is used. Using HTTP_ACCEPT can help you optimize your content delivery to match the capabilities of an unknown browser.
As a side note on this subject, be aware that it is possible to bypass server-side CGI programs for this purpose, if your focus is on providing specialized content only for Netscape-compatible browsers. This can be achieved by including a Refresh: directive embedded within your HTML document. This directive causes the browser to reload a specific document within a given time frame. You can create a HTML document that uses a <META> tag that sets the Refresh: time to 0, and points the refreshed URL to a Netscape-specific document. The syntax normally used for this strategy would be to embed
<META HTTP-EQUIV="Refresh" CONTENT = "0, URL = http://your.server.com/netscape_optimized _file.html">. This technique requires no CGI intervention, but provides an equivalent functionality of server redirection. Non-Netscape browsers ignore the META tag and display the current document, while Netscape-compatible browsers read the tag and redirect themselves to Netscape-optimized content.
Cute Header Tricks
Another useful set of HTTP headers to become aware of are the Expires: and Pragma: directives, which control how a browser caches a given document. These headers will become more relevant as the Web shifts towards database-driven models that serve documents dynamically.
The Expires: header indicates the expiration date that is intended for a given HTML document. Once that date is reached, a compliant Web browser will expire the document from its internal cache, resulting in a new request for that document. The Expires: date is expressed in terms of GMT (Greenwich Mean Time). Here is a fragment of a Perl script that outputs an Expires: header:
print "Content-type: text/plain\n"
Thursday, 21-Feb-97 12:00:00 GMT\n";
At WRQ, we incorporate Expires: headers into our Technical Note Library. Our technical document team reviews these documents for accuracy on an ongoing basis, and publishes a new set of documentation on a regular schedule. By including an expiration mechanism in documents with respect to caching, we ensure that our customers use the most recent version of our technical documentation.
For environments where much HTML is being generated "on the fly," it is useful to include a Pragma: no-cache header within the generated output. This header has the ability to disable client caching for the particular document you are currently using (test your browser's compliance with Pragma:, and see the code, at pragma.html).
As you may be aware, many browsers have differing states of compliance with respect to the HTTP specification, just as there are differing behaviors with respect to a browser's interpretation of HTML. Be sure to test your CGI programs against a wide variety of clients, just as you would normally do with HTML documents.
Parsing through headers
It is important to note that the output of the CGI examples we have looked at have generated partial HTTP responses, and are functional solely because the server still intercepts the final output of our CGI program, and packages additional information around the data (As we have previously pointed out, this information typically includes a HTTP status code, timestamp, and server identity). This mechanism is referred to as parsed HTTP headers. It also precludes most uses of real-time server output through CGI, as the CGI script typically runs to completion before the output is sent to the browser. Many Web servers can be configured to allow the use of non-parsed headers, wherein the server will not intercept the output of the CGI program, and the program can output in real time. As you may imagine, it is very important to generate complete HTTP headers when using this technique (see nph.html).
I would encourage anyone who has the desire to learn more about the HTTP protocol to browse the specs at the W3C to get a better understanding of what can be accomplished through the manipulation of HTTP streams. As usual, feel free to send me your comments via e-mail to firstname.lastname@example.org.