Web Log Analysis: Who's Doing What, When?
There are plenty of other kinds of analysis you can do with a Web log. For example, you might be interested in looking at the number of bytes used by directory--especially if you're running multiple servers to the same log file--and the frequency of retrievals by browser and referer. I have included two Perl scripts here, called Bytecount and Quickdirty, that can perform these tasks.
The Bytecount script accomplishes some very simple actions. It reads in a Web log file name, gunzips it if it's a gzip file, then analyzes the data stream by using the path. It grabs the first bounded directory--i.e., whatever it finds in /blah/--and uses that to create an associative array where the transferred bytes are accumulated. When the file's done, it creates a short summary of the usage by top-level directory. The default cutoff point is 100 kilobytes.
Quickdirty does a bit more. It's been useful for a few of my company's clients who receive daily summaries of where people come from and what they're using. This script helps them make decisions about tailoring content to users and browsers.
Quickdirty lives up to its name by avoiding any deep analysis. The script just summarizes, by request, the number of times a given browser or referer URL shows up, and spits this data out. We use a crontab to generate these reports in the middle of the night, which is why the script calls for dumping to STDOUT if a command-line argument is supplied to the program.
The two variables at the top of the file, named clientthres and refthres, provide a minimum cutoff point for the browser and referer summaries. On a given day or week, you might have thousands of unique referers and several hundred browsers, but in general, you and your clients will probably care only about the most frequently accessed referring links and the most commonly used browsers. These variables let you set the number of top responses you're interested in--say, the "top 10" browsers or the "top 100" referring links.
To give you an idea of what the commercial programs are capable of, I've included a couple of charts generated by Intersé Market Focus. The system contains information about different kinds of browsers, as well as a lookup database for all domains to their registering organization. Table 1 shows the breakdown of visits by browser. Table 2 is a list of topic referer organizations--that is, the organizations that sent the most users our way. You can also graph daily visits to the site; this could be presented in table form, too, but such information is better represented to the reader through a graph.
Table 1: Browser product No. of visits % of visits 1. Netscape Navigator 146,876 62.12 2. Unknown browser 65,092 27.53 3. CompuServe Mosaic 6,737 2.85 4. America Online 5,953 2.52 5. Lynx 5,165 2.18 6. Internet Explorer 3,234 1.37 7. NCSA Mosaic 1,207 0.51 8. IBM WebExplorer 876 0.37 9. Prodigy 722 0.31 10. Netcom Netcruiser 559 0.24 Totals: 236,421 100.00 Table 2: Referer organization No. of visits % of visits 1. Yahoo 45,895 19.41 2. Infoseek 43,513 18.40 3. Carnegie-Mellon University 6,273 2.65 4. Pittsburgh Supercomputer Center 4,018 1.70 5. Mississippi State University 3,621 1.53 6. Wake Forest University 2,665 1.13 7. Webcrawler Search Engine (AOL) 2,104 0.89 8. OpenText Corp. 1,298 0.55 9. PGH.PA.US 1,247 0.53 10. CF.AC.UK 1,087 0.46 Totals: 116,121 49.12
You may draw some inspiration from Home Improvement's enterprising Tim "Toolman" Taylor: If your Web server doesn't log referers and clients, your first impulse may be to rewire it.
I didn't do the actual rewiring of my company's CERN server myself; for that I have our talented contract programmer Raj Vaswani to thank. He rewired the CERN http daemon to provide a simple solution for logging all Web information in one file.
With this server, you can actually use a logging component to record any client variable in a separate file using the following directives. The variables' names must end with an equal sign.
EnvLog /usr/local/cern_httpd/env.cstoll EnvLogVar SCRIPT_NAME= EnvLogVar HTTP_USER_AGENT= EnvLogVar REFERER_URL=
You can use the directive
to turn the regular log file into an "extended log format" file per the above discussion.
Beating the Underbrush
These bits and pieces are certainly not the definitive way to suck every bit of information out of the mass that is a Web log. Be sure to also check out, for instance:
This information should give you the impetus to get started on Web log analysis and give you a better appreciation of how to customize the site for your user population, provide tracking information to customers, and--for some of us--to better target and sell advertising.
| [ < Web Log Analysis: Who's Doing What, When?: |
Part 1 ]
| [ Web Log Analysis: Who's Doing What, When?: |
Part 3 > ]