WebDeveloper.com �: Where Web Developers and Designers Learn How to Build Web Sites, Program in Java and JavaScript, and More!   
Web Developer Resource Directory WebDev Jobs
Animated GIFs
CSS
CSS Properties
Database
Design
Flash
HTML
HTML 4.01 Tags
JavaScript
.NET
PHP
Reference
Security
Site Management
Video
XML/RSS
WD Forums
 Client-Side
  Development

    HTML
    XML
    CSS
    Graphics
    JavaScript
    ASP
    Multimedia
    Web Video
    Accessibility
    Dreamweaver
    General
    Accessibility
    Dreamweaver
    Expression Web

    General

 Server-Side
  Development

    PHP
    Perl
    .NET
    Forum, Blog, Wiki & CMS
    SQL
    Java
    Others

 Site Management
    Domain Names
    Search Engines
    Website Reviews

 Web Development
  Business Issues

    Business Matters

 Etc.
    The Coffee Lounge
    Computer Issues
    Feedback




Web Log Analysis: Who's Doing What, When? Part 2

Web Log Analysis: Who's Doing What, When?
Part 2

by Glenn Fleishman

Other Analysis

There are plenty of other kinds of analysis you can do with a Web log. For example, you might be interested in looking at the number of bytes used by directory--especially if you're running multiple servers to the same log file--and the frequency of retrievals by browser and referer. I have included two Perl scripts here, called Bytecount and Quickdirty, that can perform these tasks.

The Bytecount script accomplishes some very simple actions. It reads in a Web log file name, gunzips it if it's a gzip file, then analyzes the data stream by using the path. It grabs the first bounded directory--i.e., whatever it finds in /blah/--and uses that to create an associative array where the transferred bytes are accumulated. When the file's done, it creates a short summary of the usage by top-level directory. The default cutoff point is 100 kilobytes.

Quickdirty does a bit more. It's been useful for a few of my company's clients who receive daily summaries of where people come from and what they're using. This script helps them make decisions about tailoring content to users and browsers.

Quickdirty lives up to its name by avoiding any deep analysis. The script just summarizes, by request, the number of times a given browser or referer URL shows up, and spits this data out. We use a crontab to generate these reports in the middle of the night, which is why the script calls for dumping to STDOUT if a command-line argument is supplied to the program.

The two variables at the top of the file, named clientthres and refthres, provide a minimum cutoff point for the browser and referer summaries. On a given day or week, you might have thousands of unique referers and several hundred browsers, but in general, you and your clients will probably care only about the most frequently accessed referring links and the most commonly used browsers. These variables let you set the number of top responses you're interested in--say, the "top 10" browsers or the "top 100" referring links.

Commercial Analysis

To give you an idea of what the commercial programs are capable of, I've included a couple of charts generated by Intersé Market Focus. The system contains information about different kinds of browsers, as well as a lookup database for all domains to their registering organization. Table 1 shows the breakdown of visits by browser. Table 2 is a list of topic referer organizations--that is, the organizations that sent the most users our way. You can also graph daily visits to the site; this could be presented in table form, too, but such information is better represented to the reader through a graph.

 Table 1: Browser product No. of visits % of visits 1. Netscape Navigator 146,876  62.12 2. Unknown browser 65,092  27.53 3. CompuServe Mosaic 6,737  2.85 4. America Online 5,953  2.52 5. Lynx  5,165  2.18 6. Internet Explorer 3,234  1.37 7. NCSA Mosaic 1,207  0.51 8. IBM WebExplorer 876  0.37 9. Prodigy  722  0.31 10. Netcom Netcruiser 559  0.24 Totals:  236,421  100.00 Table 2: Referer organization  No. of visits % of visits 1. Yahoo  45,895  19.41 2. Infoseek  43,513  18.40 3. Carnegie-Mellon University  6,273  2.65 4. Pittsburgh Supercomputer Center 4,018  1.70 5. Mississippi State University 3,621  1.53 6. Wake Forest University  2,665  1.13 7. Webcrawler Search Engine (AOL) 2,104  0.89 8. OpenText Corp.  1,298  0.55 9. PGH.PA.US  1,247  0.53 10. CF.AC.UK  1,087  0.46 Totals:  116,121  49.12 

CERN Rewiring

You may draw some inspiration from Home Improvement's enterprising Tim "Toolman" Taylor: If your Web server doesn't log referers and clients, your first impulse may be to rewire it.

I didn't do the actual rewiring of my company's CERN server myself; for that I have our talented contract programmer Raj Vaswani to thank. He rewired the CERN http daemon to provide a simple solution for logging all Web information in one file.

With this server, you can actually use a logging component to record any client variable in a separate file using the following directives. The variables' names must end with an equal sign.

 EnvLog /usr/local/cern_httpd/env.cstoll EnvLogVar SCRIPT_NAME= EnvLogVar HTTP_USER_AGENT= EnvLogVar REFERER_URL= 

You can use the directive

 LogFormat Extended 

to turn the regular log file into an "extended log format" file per the above discussion.

Beating the Underbrush

These bits and pieces are certainly not the definitive way to suck every bit of information out of the mass that is a Web log. Be sure to also check out, for instance:

Internet Profiles
Intersé
NetCount

This information should give you the impetus to get started on Web log analysis and give you a better appreciation of how to customize the site for your user population, provide tracking information to customers, and--for some of us--to better target and sell advertising.

[ < Web Log Analysis: Who's Doing What, When?:
Part 1 ]
[ Web Log Analysis: Who's Doing What, When?:
Part 3 > ]




HTML5 Development Center


Recent Articles