Web Log Analysis: Who's Doing What, When?
by Glenn Fleishman
Reprinted from Web Developer® magazine, Vol. 2 No. 2 May/June 1996 © 1996
Analyzing your Web site's traffic can be a fascinating study of how users traverse your pages, but it can also lead to information overkill. For example, every month my company logs about five or six times more bytes recording information about the visits to our clients' Web sites than the Web sites themselves actually contain! Ideally, you'd want to preserve this information in a form that combines analysis with rapid access.
A commercial product may not be ideal for many sites. The software can range in price from a few hundred to several thousand dollars for "local" analysis, where you run the software on one of your own machines feeding information to and from relational databases. You could also expect to pay up to thousands of dollars a month for services like NetCount or Internet Profiles (I/Pro) that require you to transmit the log files regularly--even hourly--to a remote site, where the analysis is conducted and reports generated.
Determining if a commercial product would benefit your company depends in large part on how highly customized an analysis you need. If your needs are fairly general, there are many simple shareware programs available for analyzing data locally. If you want to analyze unique visits in a comprehensive way, it makes sense to migrate to a commercial solution or invest in some significant in-house development.
This issue's column will show you how to do quick-and-dirty analyses using your own resources and the code samples provided. You'll also learn how to decipher datestamps for user analysis, and how to think algorithmically about generating site information. In the end, you may decide to opt for a commercial solution, but you'll do so in an informed manner.
There are several kinds of log formats, but here we'll exclusively address the Common Log Format (CLF), which can be used by most Web servers. Some servers, such as the Open Market series, Netscape servers, and the Microsoft Internet Information Server, log information in slightly different manners, but they function in basically the same way.
In addition to the fields shown in the CLF (see "Common Log Format"), there are several other useful pieces of information that your visitors' browsers may reveal, as I discussed in last issue's column. These can help you determine what kinds of users are visiting the site:
- RefererURL: contents of REFERER_ URL client variable. This returns the URL of the last location the browser was before the user came to your site.
- Client type: contains the contents of the HTTP_USER_AGENT client variable. This includes the browser's name, version, and user platform.
- Cookie: contents of HTTP_COOKIE variable. These are persistent tokens defining a unique user that browsers and servers can pass back and forth across sessions. The cookie works only with certain browsers, but Netscape Navigator supports it fully, and other browsers do to a lesser extent.
Intersé Corp. is one commercial vendor that adds proprietary extensions to the CLF to include these three variables. The Intersé Extended Log Format uses the following names to refer to these three fields: "referer," "browser," and "cookie".
These three fields then appear in the Intersé Extended Log Count following the byte count, as in this example:
spaghetti.west.edu - - [30/Feb/1996:06:09:53 -0800] "GET/film/reviews/D/dangerous.minds.horton.html HTTP/1.0" 200 3828 "http://search.yahoo.com/bin/search?p=dangerous+minds" "Mozilla/1.22 (Windows; I; 32bit)" "18.104.22.168.8445454454"
At my company, we've modified the CERN 3.0 server to support this format, which is useful for other purposes as well.
What's in a Visit?
If you've ever tried to buy or sell advertising on the Web, you've run up against the question of what constitutes a unique impression on an ad, or, more generally, what constitutes a unique visit.
There can be a lot of confusion about these semantics; I propose the following terms based on common Net usage:
So if you're asked "how many people really visit your Web site?" you can answer with a certain amount of confidence in terms of unique visits per day.
- Visitor: a unique individual who can be tracked by registration or cookie.
- Visit: a unique trip to a Web site, defined by a period of time during which a visitor browses the site.
Sites that require or allow registration of users can use the LOGNAME variable in any log format to track unique visits in time (using datestamps) by known visitor (using the LOGNAME). If your site doesn't require visitor registration, there are two main options for tracking unique visits.
- By cookie. If the server assigns cookies, as many as 90 percent or more of return visitors will have a cookie that can be logged (based on the popularity of the browsers that support cookies). Cookies can have expiration dates, so a choice can be made about the period of time over which data about specific users is logged. This doesn't tell you anything about an individual--unless you use a database to make a correspondence between cookies and registration information--but it does tell you about the unique visits by an unknown unique visitor.
- By hostname or IP number. This is much less reliable, but for the majority of locations on the Net (especially online services) it will identify a unique simultaneous user. As noted in last issue's column, simultaneous users from America Online, Netcom, and other services will have unique hostnames or IP numbers. This method will only give you unique visits; there's no way for you to quantify unique visitors.
Tracking unique visits by cookie and by hostname/IP numbers both require timestamp analysis.
The algorithm for measuring unique visits and visitors is pretty straightforward; programming it is a damn sight more complex. At one point I attempted to build my own simple user analysis program; the flaw in the program is that it really requires some kind of DBMS.
The logic is sound, though. You want to break down the top-level information: log users, then log visits. The algorithm is essentially:
Analyze line of logfile If there's a LOGNAME, find record associated with it Otherwise if there's a cookie, find that record Otherwise if there's a hostname, find that record Otherwise use the IP number as the record index Analyze datestamp Is the most recent request within 30 minutes? Yes: add duration and other info to record No: log new visit for this user
Depending on how much detail you want, you can log successes and failures by HTTP header code (with 200 as a success, and other codes indicating other results); total bytes transferred; the duration of the visit; the pages visited; the path through to those pages; and so on.
| [ Web Log Analysis: Who's Doing What, When?: |
Part 2 > ]