Click to See Complete Forum and Search --> : Help needed for Perl script


Andy
06-16-2003, 06:53 PM
Hi all,

I am new to this group. I need help regarding a perl script which parses the web log file, access_log.

The format of the access_log is:

127.0.0.1 - - [15/Jun/2003:13:54:02 -0100] "GET /xxxx HTTP/1.1" 200 34906

The goal is to

1. Perfom a count of the pages for the given timestamp. It is possible that multiple pages exist with the same timestamp (As the timestamp I mentioned above).
2. Within a range of time interval, say, 15 minutes starting with the timestamp of the first line in the log file, I would like to compute the average of the number of pages, minimum and maximum number of pages in that interval.

3. I would like the output as below. Following is just an example.

Time Average Pages Min Pages Max Pages
--------------------------- ----------------- -----------------
15/Jun/2003:14:09:02 6.5 3 10
15/Jun/2003:14:24:02 5.5 4 7


I shall appreciate an early response.

Thanks in advance

Regards
Andy

Andy
06-18-2003, 01:36 PM
Hi all,

I developed a perl script to parse the web log, access_log. However, I am having difficulty in getting the output that I want.

The output that I am looking at is:


Time Average Pages Min Pages Max Pages
--------------------------- ----------------- -----------------
15/Jun/2003:14:09:02 6.5 3 10
15/Jun/2003:14:24:02 5.5 4 7
-----------------

Pl. look at the perl script and suggest any changes to the script.

Thanks in advance

Andy
:-)
-----------
Here's the Perl script:

#!/usr/bin/perl

use Getopt::Long;
use Time::Local;

my $file="access_log_modified";
my $line;
my $count;
my $begin_time = "";
my $end_time;
my %seen = ();
my @visual_pages = ();
my ($datetime, $get_post, $Day, $Month, $Year, $Hour, $Minute, $Second);
my $interval = 60; #An interval of 1 minute
my @pages_processed;

count_recs();

sub count_recs {

open (INFILE, "<$file") || die "Cannot read from $file";
WHILELOOP: while (<INFILE>) {
$line = $_;
chomp;
($datetime,$get_post) = (split / /) [3,6];
$datetime =~ s/\[//;
($Day,$Month,$Year,$Hour,$Minute,$Second)= $datetime =~m#^(\d\d)/(\w\w\w)/(\d\d\d\d):(\d\d):(\d\d):(\d\d)#;

next WHILELOOP if ($get_post =~ /\.js$/ || $get_post =~ /\.gif$/ || $get_post =~ /\.css$/);

unless ($begin_time) {
$begin_time = $datetime;
}
$end_time = $datetime;


&calculate_time($begin_time, $end_time);
} #while

foreach $visual_page (sort by_seen keys %seen) {
push (@{$pages_processed{$visual_page}}, $seen{$visual_page});

}

foreach $page_processed (sort keys %pages_processed) {
print "$page_processed: @{$pages_processed{$page_processed}}\n";
}

close(INFILE);
}

sub calculate_time {

my @visual_pages = ();
my @processed_visual_pages = ();

###Break up the date time into Day, Month, Year, Hour, Minute and Second.

($begin_Day,$begin_Month,$begin_Year,$begin_Hour,$begin_Minute,$begin_Second)= $begin_time =~m#^(\d\d)/(\w\w\w)/(\d\d\d\d):(\d\d):(\d\d):(\d\d)#;

($end_Day,$end_Month,$end_Year,$end_Hour,$end_Minute,$end_Second)= $end_time =~m#^(\d\d)/(\w\w\w)/(\d\d\d\d):(\d\d):(\d\d):(\d\d)#;

###Since the Day above is in the Alpha format, Jan, Feb,... and not numeric
###format, 01, 02, 03,..., we need to convert it to a numeric format.Otherwise,
###we cannot pass Day to timelocal or localtime modules. That's why the
###subroutine is called. It converts Jan into 01 and so on.

&Initialize;

my $begin_seconds = timelocal($begin_Second, $begin_Minute, $begin_Hour, $begin_Day, $MonthToNumber{$begin_Month}, $begin_Year-1900);

my $end_seconds = timelocal($end_Second, $end_Minute, $end_Hour, $end_Day, $MonthToNumber{$end_Month}, $end_Year-1900);

###elapsed time is the difference between two timestamps of two consecutive
###records in the log file.

my $elapsed = $end_seconds - $begin_seconds;

###We check whether the elapsed time is greater than the interval that we
###choose, 1 minute or 15 minutes. If yes, then we need to start counting the
###records into a new 15 minute interval. If no, count the number of records
###in the same interval. Also, reset the begin_time and end_time, for the new
###count. Store all the interval periods into an array, processed_visual_pages.

if ( $elapsed > $interval ){
$count = 0;
$begin_time = $end_time;
$end_time = $datetime;
push (@processed_visual_pages, $end_time);
} else {
push (@visual_pages, $end_time);
foreach $visual_page (@visual_pages) {
$seen{$visual_page}++;
}
}
}

sub Initialize {
my %MonthToNumber=(
'Jan', '01',
'Feb', '02',
'Mar', '03',
'Apr', '04',
'May', '05',
'Jun', '06',
'Jul', '07',
'Aug', '08',
'Sep', '09',
'Oct', '10',
'Nov', '11',
'Dec', '12',
);

my %NumberToMonth=(
'01', 'Jan',
'02', 'Feb',
'03', 'Mar',
'04', 'Apr',
'05', 'May',
'06', 'Jun',
'07', 'Jul',
'08', 'Aug',
'09', 'Sep',
'10', 'Oct',
'11', 'Nov',
'12', 'Dec',
);

}

sub by_seen () {


( $seen{$b} cmp $seen{$a} );

}
-----------------

The output I get is:

25/Apr/2003:13:54:02: 3
25/Apr/2003:13:54:19: 2
25/Apr/2003:13:54:22: 4
25/Apr/2003:13:54:34: 3
25/Apr/2003:13:54:38: 5
25/Apr/2003:13:54:41: 3
25/Apr/2003:13:54:43: 6
25/Apr/2003:13:54:44: 3
25/Apr/2003:13:54:46: 5
25/Apr/2003:13:54:47: 2
25/Apr/2003:13:54:48: 3
25/Apr/2003:13:54:50: 7
25/Apr/2003:13:54:51: 4
25/Apr/2003:13:54:53: 2
25/Apr/2003:13:54:58: 3
25/Apr/2003:13:55:01: 2
25/Apr/2003:13:55:02: 4
25/Apr/2003:13:55:05: 4
25/Apr/2003:13:55:08: 1
25/Apr/2003:13:55:14: 3
25/Apr/2003:13:55:15: 1
25/Apr/2003:13:56:13: 5
25/Apr/2003:13:56:27: 5
25/Apr/2003:13:56:35: 4
25/Apr/2003:13:56:40: 4
25/Apr/2003:13:56:45: 1
25/Apr/2003:13:56:51: 5
-----------------------------

I would like to group the output by interval, say a 1 minute interval. So, I want to see all the entries starting with 25/Apr/2003:13:54:02 and ending with 25/Apr/2003:13:55:02 grouped as:

Time Average Pages Min Pages Max Pages
25/Apr/2003:13:55:02 5.5 4 7
----------------------