Web Log Analysis: Who's Doing What, When?
Part 3
BYTECOUNT
#!/usr/local/bin/perl
require 'ctime.pl';
$root = "/usr/logs/";
if ($ARGV[0]) { $x = $ARGV[0]; }
else { print "File name? "; $x = <STDIN>; chop $x; }
while (!-e "$root/$x") {
print "Bad file name\nFile name? ";
$x = <STDIN>;
chop $x;
}
if ($x =~ /\.gz$/i) {
open (IN, "gunzip -c $root/$x |");
} else {
open (IN, "< $root/$x");
}
while (<IN>) {
/()()()()/;
/\"(GET|POST|HEAD)
\/([^\/\s\-A-Z]+)\/[^\"]*\" ([0-9]*) ([0-9]*)/;
$clients{$2} += $4;
$i++;
}
close IN;
open (OUT, ">> bytereport");
select (OUT);
print &ctime(time) . "\n";
foreach (sort keys %clients) {
if ($clients{$_} > 100000)
{
printf "%-15.15s : %15.15s\n", $_, $clients{$_};
}
}
select (STDOUT);
close OUT;
QUICKDIRTY
#!/usr/local/bin/perl
$root = "/usr/logs/";
$clientthres = 1000;
$refthres = 10;
if ($ARGV[0]) { $x = $ARGV[0]; }
else { print "File name? "; $x = <STDIN>; chop $x; }
while (!-e "$root/$x") {
print "Bad file name\nFile name? ";
$x = <STDIN>;
chop $x;
}
/()/;
$x =~ /httpd\-log\.([^\.]*)\./;
if ($1) { $nomain ="www.${1}"; }
else { $nomain = "niente"; }
open (ENV, "< $x") || die "Can't open file\n";
while (<ENV>) {
/()()/;
chop;
/\"([^\"]*)\" \"([^\"]*)\" \"[^\"]*\"$/;
$ref = $1;
$cli = $2;
/()()/;
/\"(GET|POST|HEAD) \/([^\/]*)\//;
$head = $2;
if ($ref ne "-" & $ref =~ /http/i & \
$ref !~ /$nomain/i)
{ $url{$ref}++; }
if ($cli) { $browser{$cli}++; }
}
if (!$ARGV[0]) {
open (OUT, "> env.temp");
select (OUT);
}
print "\nREFERERS:\n";
foreach (sort keys %url)
{ $urlnum{$url{$_}} .= "$_\n"; }
for $i (0..($refthres - 1)) { $urlnum{$i} = ""; }
foreach $num (sort numerically keys %urlnum) {
foreach $val (split('\n', $urlnum{$num})) {
printf ("%-70.70s : %5.5s\n", $val, $num);
}
}
print "\n\nBROWSERS:\n";
foreach (sort keys %browser) {
if ($browser{$_} > $clientthres)
{
printf ("%-60.60s : %10.10s\n", $_, $browser{$_});
}
}
select (STDOUT);
sub numerically { $a <=> $b; }
Common Log Format
The common log format appears exactly as follows:
host/ip rfcname logname [DD/MMM/YYYY:HH:MM:SS -0000]
"METHOD /PATH HTTP/1.0" code bytes
|
host/ip
|
If reverse DNS works and DNS lookup is enabled, the hostname of
the client is dropped in; otherwise the IP number displays.
|
|
RFC name
|
If you enable identd (see Web Developer® Spring 1996 issue, p. 23), you
can retrieve a name from the remote server for the user. If no value is
present, a "-" is substituted.
|
|
logname
|
If you're using local authentication and registration, the user's log
name will appear; likewise, if no value is present, a "-" is substituted.
|
|
datestamp
|
The format is day, month (three-letter abbreviation), year, hour in 24-
hour clock, minute, second, and the offset from Greenwich Mean
Time (for example, Pacific Standard Time is -0800).
|
|
retrieval
|
Method is GET, PUT, POST, or HEAD; path is the path and file
retrieved; HTTP/1.0 defines the protocol.
|
|
code
|
HTTP completion code. 200 is successful, 304 is a reload from cache,
404 is file not found, and so forth.
|
|
bytes
|
number of bytes in file retrieved.
|
Here's an example:
sniksnak.foobar.org - - [30/Feb/1996:06:03:24 -0800]
"GET /film/logos/the.movies.main.gif HTTP/1.0" 200 278
[ < Web Log Analysis: Who's Doing What, When?: Part 2 ] |
[ Web Log Analysis: Who's Doing What, When?: Part 1 > ] |
|