Web Log Analysis: Who's Doing What, When? Part 3 BYTECOUNT #!/usr/local/bin/perl require 'ctime.pl'; $root = "/usr/logs/"; if ($ARGV[0]) { $x = $ARGV[0]; } else { print "File name? "; $x = <STDIN>; chop $x; } while (!-e "$root/$x") { print "Bad file name\nFile name? "; $x = <STDIN>; chop $x; } if ($x =~ /\.gz$/i) { open (IN, "gunzip -c $root/$x |"); } else { open (IN, "< $root/$x"); } while (<IN>) { /()()()()/; /\"(GET|POST|HEAD) \/([^\/\s\-A-Z]+)\/[^\"]*\" ([0-9]*) ([0-9]*)/; $clients{$2} += $4; $i++; } close IN; open (OUT, ">> bytereport"); select (OUT); print &ctime(time) . "\n"; foreach (sort keys %clients) { if ($clients{$_} > 100000) { printf "%-15.15s : %15.15s\n", $_, $clients{$_}; } } select (STDOUT); close OUT; QUICKDIRTY #!/usr/local/bin/perl $root = "/usr/logs/"; $clientthres = 1000; $refthres = 10; if ($ARGV[0]) { $x = $ARGV[0]; } else { print "File name? "; $x = <STDIN>; chop $x; } while (!-e "$root/$x") { print "Bad file name\nFile name? "; $x = <STDIN>; chop $x; } /()/; $x =~ /httpd\-log\.([^\.]*)\./; if ($1) { $nomain ="www.${1}"; } else { $nomain = "niente"; } open (ENV, "< $x") || die "Can't open file\n"; while (<ENV>) { /()()/; chop; /\"([^\"]*)\" \"([^\"]*)\" \"[^\"]*\"$/; $ref = $1; $cli = $2; /()()/; /\"(GET|POST|HEAD) \/([^\/]*)\//; $head = $2; if ($ref ne "-" & $ref =~ /http/i & \ $ref !~ /$nomain/i) { $url{$ref}++; } if ($cli) { $browser{$cli}++; } } if (!$ARGV[0]) { open (OUT, "> env.temp"); select (OUT); } print "\nREFERERS:\n"; foreach (sort keys %url) { $urlnum{$url{$_}} .= "$_\n"; } for $i (0..($refthres - 1)) { $urlnum{$i} = ""; } foreach $num (sort numerically keys %urlnum) { foreach $val (split('\n', $urlnum{$num})) { printf ("%-70.70s : %5.5s\n", $val, $num); } } print "\n\nBROWSERS:\n"; foreach (sort keys %browser) { if ($browser{$_} > $clientthres) { printf ("%-60.60s : %10.10s\n", $_, $browser{$_}); } } select (STDOUT); sub numerically { $a <=> $b; } Common Log Format The common log format appears exactly as follows: host/ip rfcname logname [DD/MMM/YYYY:HH:MM:SS -0000] "METHOD /PATH HTTP/1.0" code bytes | host/ip | If reverse DNS works and DNS lookup is enabled, the hostname of the client is dropped in; otherwise the IP number displays. | | RFC name | If you enable identd (see Web Developer® Spring 1996 issue, p. 23), you can retrieve a name from the remote server for the user. If no value is present, a "-" is substituted. | | logname | If you're using local authentication and registration, the user's log name will appear; likewise, if no value is present, a "-" is substituted. | | datestamp | The format is day, month (three-letter abbreviation), year, hour in 24- hour clock, minute, second, and the offset from Greenwich Mean Time (for example, Pacific Standard Time is -0800). | | retrieval | Method is GET, PUT, POST, or HEAD; path is the path and file retrieved; HTTP/1.0 defines the protocol. | | code | HTTP completion code. 200 is successful, 304 is a reload from cache, 404 is file not found, and so forth. | | bytes | number of bytes in file retrieved. | Here's an example: sniksnak.foobar.org - - [30/Feb/1996:06:03:24 -0800] "GET /film/logos/the.movies.main.gif HTTP/1.0" 200 278 [ < Web Log Analysis: Who's Doing What, When?: Part 2 ] | [ Web Log Analysis: Who's Doing What, When?: Part 1 > ] | |