WebDeveloper.com �: Where Web Developers and Designers Learn How to Build Web Sites, Program in Java and JavaScript, and More!   
Web Developer Resource Directory WebDev Jobs
Animated GIFs
CSS
CSS Properties
Database
Design
Flash
HTML
HTML 4.01 Tags
JavaScript
.NET
PHP
Reference
Security
Site Management
Video
XML/RSS
WD Forums
 Client-Side
  Development

    HTML
    XML
    CSS
    Graphics
    JavaScript
    ASP
    Multimedia
    Web Video
    Accessibility
    Dreamweaver
    General
    Accessibility
    Dreamweaver
    Expression Web

    General

 Server-Side
  Development

    PHP
    Perl
    .NET
    Forum, Blog, Wiki & CMS
    SQL
    Java
    Others

 Site Management
    Domain Names
    Search Engines
    Website Reviews

 Web Development
  Business Issues

    Business Matters

 Etc.
    The Coffee Lounge
    Computer Issues
    Feedback




"I Changed My Mind"

by Glenn Fleishman

Have you ever received a memo from a client (at the last minute before going live with a new site, of course) that read something like this? "You know that decision we made three months ago to have individual buttons instead of button bars? It was wrong. Please change to imagemapped button bars before the site goes up." That's when you wish you had a magic wand that you could wave and make everything all better.

Let's be honest; and in the spirit of confession, I'll stand up first. Despite a number of new site management tools that allow you to make global changes with just one click, most of the sites my company maintains consist of static pages. The reason is rooted in history, cost, and practicality.

The vast majority of Web sites don't involve dynamic management, because they were built for less than $100,000, built more than a year ago, or don't change often enough to require a back-end program to manage the data. In this column, we'll discuss how to think about rebuilding a site in place, as well as present a full-fledged Perl script for doing the recursive task at hand.

The Ideal Tool

An ideal World Wide Web would have had rich management tools from day one. However, as is the case with most ad hoc efforts, the key to the Web is its functionality, not how to manage the beast.

Visualizing an ideal tool will help us focus on those aspects of site management that can be done--though it does take some effort--with a recursive script. The perfect tool would:

  • Have a header and footer setting. Every page would be built from a template for the top and bottom text. Changing that template would rebuild all the associated pages.

  • Allow insertions of text groupings that can be replaced easily, such as advertising banners or "this week's news." Servers that support "server-side includes," in which a token can be replaced with specific text or with the output of a CGI program, allow this today. In most cases, however, the performance hit on the server to generate these pages on the fly may be unnecessary if the information changes infrequently.

  • Enforce or create consistent referencing. Some applications do this today, such as Adobe SiteMill and Microsoft FrontPage. When links get out of sync, time gets wasted. Sometimes the same link might be referenced as "../filename," "filename," and "/path/to/filename." Updating that link and references to it then becomes tedious, and errors become easy to miss.

  • Enforce relative, not absolute, links within the site. Relative links reference from the root of the site, but don't include the server name itself in the link. Instead of "http://www.bemyguest.com/bell/book/candle.html," you'd instead reference "/bell/book/candle.html." The key attraction here is mobility: Migrating an entirely relatively linked site might be a drag-and-drop operation (or a tar and extract), rather than another entire rewrite of the site. It also makes it possible to test a site locally off a hard drive, or take it on the road.

  • Perform code-checking and correcting. Several versions of weblint, an HTML validator allow individual pages to be checked easily and thoroughly. The perfect tool would catch errors in hand-coding or referencing, and rewrite them to spec.

  • Have consistency enforcement. In a recent job of mine, pages had been assembled in Adobe PageMill, with Internet Assistant for Microsoft Word, and by hand. The resultant code used curly quotes (" ") in some places, the HTML symbol """ to insert a quotation mark, and straight double quotes ("). Common elements like this should be itemized and made standard.

  • Fix all line endings. Macintosh, Windows/ DOS, and Unix machines all have a different idea of what signals a line ending (a hard return). Macs like ASCII character decimal 13 or hex 10 (carriage return), Unix desires the linefeed (ASCII decimal 10 or hex 0A), while DOS and Windows stick them both in for good luck. Adobe PageMill and SiteMill allow retargeting by platform through a pop-up menu, while other applications, like BBEdit, provide controls for rewriting the ASCII codes for appropriate platforms.

And, of course, you should have all these ideal features through a remote Web interface as well. Now that we've outlined the ideal case, let's look at how we can implement most of these features through good planning, or insert them in a migration from an old site plan to a new one.

Planning a modular site

To paraphrase Shakespeare and a folk saying, "An ounce of planning is worth a pound of flesh." Anytime you can build something into your HTML files, your scripts and programs, or your databases that will help you later--tagging, commenting, adding additional unused fields for later development--it will save you hours or days later.

A key element of planning is certainly to start with a comprehensive site map. For Film.com, one of our more complex clients, the Web developers, graphic designers, and content producers devoted a total of about 40 hours to charting out the new structure of the site, including specific trees, menu items, and section names. This process allowed a very rapid progression to finished design and made the transition a snap, since every player on the team had the same picture of the site to work from.

The site map doesn't have to be an electronic document--butcher paper is easier to work with for conceptualizing. On the other hand, a producer at Studio Archetype (formerly Clement Mok Designs) noted that the company creates a perfect visualization in Adobe Illustrator of the entire site for client approval before beginning the work.

Repeating Elements

Most sites repeat menus, icons, branding elements, and navigation tools from page to page, varying by section. Making a consistent set of templates, in the form of headers, footers, and complete blank pages, will save time later (variation from these templates should be punishable by cleaning the main server with a toothbrush).

In rebuilding Film.com, we created six standard templates, one for each major section. These templates were used at the time of transition, where old headers and footers (which were inconsistent in the extreme) were removed, and the new ones inserted as the site was rewritten. Now, new pages are created in various applications using the templates formed by merging those tops and bottoms.

As an aid to updating these elements later, we used comment tags to stand them off from the rest of the copy. After the body tag, we put

<!-- start of header -->

and after the full header appears, we add

<!-- end of header -->

These comments stand on lines by themselves for easier parsing later. The body of each file and the footer have similar tags for differentiating them.

Tools

Once you've mapped out your plan of attack, you'll need the tools to make it happen. The tool I'm providing here is essentially a framework on which you can hang different kinds of behavior. This perl script, which we'll call "Migrate.pl," takes what was the existing structure of the site; drops the files by category into new directories; rewrites all URLs in the files referenced to reflect the new hierarchy and organization; and corrects a multitude of common coding errors.

At the same time, it sucks out the old, poorly delineated footers and headers--there were no comments or other marks to identify them--and drops in the new, neatly tagged bits as appropriate.

Migrate.pl (which you can find at migrate.html) doesn't rewrite files so much as it creates a new, parallel directory structure into which the rewritten files are placed. Later, I'll point out the minor changes necessary to allow files to be rewritten in place. That method is just a lot riskier if you make errors.

#!/usr/bin/perl

First you define the source directory ($rootdir) and the destination the new files will be written to ($destdir). The $infodir is where the log of changes made is written.

$rootdir = "/usr/www/film";
$destdir = "/usr/www/filmbeta";
$infodir = "/usr/construction/little_fixers";
$done = 0;

A lot of complex activity is taking place here; the elements of the %sub associated array are indexed by the new directories, into which the old directories are being folded. For example, in the old structure of Film.com, "video," "scarecrow," and "raincity" were all separate directories at the top level of the site. In this new structure, they are all folded into the hv (home viewing) directory, and all URLs need to be rewritten to reflect this.

This corresponds to changes we made in our server configuration file as well, to allow backward compatibility. Anyway, going to "/scarecrow" on the old site was automatically redirected to "/hv/scarecrow" on the new site. We swapped out the directories and the configuration files simultaneously.

The $files variable holds special cases--files that were formerly at the root level of the site, but which should now be nested into the "admin" directory to get them out of the way.

$sub{'store'} = " store ";
$sub{'screening'} = " screening ";
$sub{'filmchat'} = " filmchat poll ";
$sub{'industry'} = " cinecism news filmfests
interviews misc oscar96 ";
$sub{'reviews'} = " 1995 1996 craft archives
capsules current reviews ";
$sub{'hv'} = " video scarecrow raincity ";
$sub{'images'} = " logos images ";
$files = " AT-filmquery.html SIFFwomensFilm96.html
 ad.rates.html about.film.html
 error.alt.html  curtain.gif error.html error.pl
 exit.html index.html
 index.response.html main.html rules.road.html
 survey.html ";

The new directory structure must be seeded with the top-level directory names so that the script iterates through all the keys in %sub to create each of those directories in turn.

foreach (keys %sub) { if (!(-d "$destdir/$_"))
{ system("mkdir $destdir/$_"); }}

The simple seeder to the recursive routine calls it once, and then it recurses until the entire directory structure beneath $rootdir has been processed.

$rootnow = "";
open (LOG, "> $infodir/film.log");
if (-d "$rootdir/$rootnow") {
	print LOG "\nNow working on $rootdir/
$rootnow directory\n";
	close LOG;
	&dir_recur($rootnow);
}
open (LOG, ">> $infodir/film.log");
print LOG "Total HTML files: $html\n";
close LOG;

This external routine cleaned up some loose ends, fixed some file permissions, and so forth. It's external for ease of use.

system ("$infodir/fix.bits");

The "big momma" recursive routine:

sub dir_recur {
local($localdir) = @_;
local(@dircontents, $diropen, $dirxfer,
$interdirmatch);
open (LOG, ">> $infodir/film.log");
$localdir =~ s/^\///;

The script needs to add in the correct new main directory name whenever the $localdir variable contains the name of one of the old top-level directories. This little bit of code inserts that where needed, but doesn't hurt the next level of recursion.

if ($localdir !~ /\//) {
	$intdirmatch = 0;
	foreach (keys %sub) {
		if ($sub{$_} =~ / $localdir /
&& $_ ne $localdir) {
			$intdir = "\/$_";
			$realhead = $_;
			$intdirmatch = 1;
		}
	}
	if (!$intdirmatch) { $intdir = "";
$realhead = $localdir; }
}

The source and destination directories are set, and the destination directory is created if it doesn't already exist.

$diropen = "$rootdir/$localdir";
$dirxfer = "${destdir}$intdir/$localdir";
if (!(-d $dirxfer)) { system ("mkdir $dirxfer"); }

The current directory is read, and superfluous files are omitted. In this case, we omit anything like "Network Trash Folder" or that begins with ".hs" (special and irrelevant directories created by our Unix-based AppleShare server), as well as symbolic links (-l flag), and the . and .. special files.

print LOG "Now working on $diropen \n";
opendir (CURRDIR, $diropen);
@dircontents = sort (grep(!/Network Trash Folder/
& !/^\.hs/i & !(-l $_) &
 !/^\.\.?$/, readdir(CURRDIR)));
close CURRDIR;

This is the heart of the recursive processing:

foreach (@dircontents) {
	$file = $_;

If the file is a directory and isn't a symbolic link, the new directory structure is passed on to the next level of recursion to this same routine.

	if ($file && (-d "$diropen/$file") &
!(-l "$diropen/$file")) {
}
		$next_dir = "$localdir/$file";
		$_ = $next_dir;
		&dir_recur($next_dir);

If the file is named .htm, .html, or .pl, and the Unix system sees it as a text file, the processing begins.

	} elsif ((/\.html?$/i | /\.pl/) &&
(-T "$diropen/$file")) {
		$html++;
		open (IN, "< $diropen/$file");

If the file lives at the top level of the old site, the script rewrites the path into the admin directory of the new structure for neatness' sake. This way, we're not left with detritus at the root level. The script uses a system call to set new ownership and permissions, to make this more explicitly clear; it can be done inside Perl if you look up the various codes and user/group IDs that you need to specify.

if ($diropen eq "$rootdir/" && $files =~ /
$file /) {
	if (!(-d "$destdir/admin")) { system
("mkdir $destdir/admin"); }
	open (OUT, "> $destdir/admin/$file");
	system("chown root.verybest $destdir/admin/
$file; chmod ug+w $destdir
        /admin/$file");
} else {
	open (OUT, "> $dirxfer/$file");
	system("chown root.verybest $dirxfer/$file;
chmod ug+w $dirxfer/$file");
}
$body = $perl = $htmlprint =
$headyet = 0;

Quite a lot of the code goes to normalizing the HTML that is being moved over. Since the site, in its first year of operation, was created at various times and in various applications, the code has shifted a bit. The $htmlprint variable makes sure that every file begins with an <HTML> tag, for instance. Other tags ensure that a HEAD and BODY tag are in place. If tags are already there, we want to wipe out a lot of them so we can stick our new, improved tags in place.

while (<IN>) {
	if (!$htmlprint) { print OUT "<HTML>\n";
 $htmlprint = 1;
}
	if (/<\/head>/i & !$headyet)
{ $headyet = 1; }
	elsif (/<\/head>/i)
{ s/<\/head>//i; $headyet = 1; }
	s/<\/?html[^>]*>?//gi;
	if (/<body/i) { $body += 1; }
	s/<\/?body[^>]*>?//gi;
	if (/<\/title/i) { $body += 1; }

The first two substitutions try to find any occurrence of mismatched quotation marks, or missing quotation marks inside HTML tags. The next four try to normalize all appearances of the site's absolute URL.

s/\=([^\"]{1,1}[^> ]*)\">/\=\"\1\">/gi;

s/\=([^\"]{1,1}[^> ]*)>/\=\"\1\">/gi;

s/(\=\"http\:\/\/www\.film\.com\/)film\//\1/gi;

s/(\=\"http\:\/\/www\.film\.com\/)film\"/\1\"/gi;

s/\=\"http\:\/\/www\.film\.com\"/\=\"\/\"/gi;

s/\=\"http\:\/\/www\.film\.com\/\"/\=\"\/\"/gi;

s/\=\"http\:\/\/www\.film\.com\/\"/\=\"\/\"/gi;

The script now, having normalized everything into a standard form, rewrites all absolute URLs to relative ones, simultaneously inserting the correct directory structure for the new organization of the site as defined in the %sub array noted earlier.

/()/;
while (/\=\"http\:\/\/www\.film\.com\/([^\"]*)\"/i) {
	$ref = $1;
	/()()/;
	$ref =~ /^([^\/]*)(\/.*)/;
	if ($1) {
		$head = $1; $tail = $2;
		foreach (keys %sub) {
			if ($sub{$_} =~ / $head / &&
$head ne $_)
                        { $head = "$_\/$head"; }
		}
	}
	if ($head) { $ref = "${head}$tail"; }

s/\=\"http\:\/\/www\.film\.com\/([^\"]*)\"/\=
\"\/$ref\"/i;
	}
	/()/;
	if (/(action|href)\=\"\/([^\/]*)\"/ &&
$files =~ / $1 /)
{

s/(action|href)(\=\"\/)([^\/]*)\"/\1\2admin\/\2\"/gi;
	}

Here, old images that are not part of the new design of the site needed to be removed.

s/<img[^>]*src\=[^>]*logos\/
(archives|interviews|video|spot|the\.movies|
theater|chairs|craft|cinecism|film|hot|logo
|news|current)[^>]*gif[^>]*>//gi;

I left in a few examples of old text that needed to be just zeroed out; these statements come after everything else has already been sorted out.

s/<[^>]*>Produced by Point of Presence
Company<[^>]*>//i;
s/<i><B>Pages by Film\.com\, Inc\.<\
/B><\/i>//i;
s/<[^>]*>Pages by.*Presence Company<
[^>]*>//i;
s/<[^>]*>Produced by .*ompany<[^>]
*>//i;

/()()/;
$remains = "";
/(.*<\/head>)(.*)/i;
if ($1 & $2) { $_ = $1; $remains = $2; }

Anything with cgi-bin gets logged, so you can make sure later that you haven't broken a reference to a script that's called from the site, and that the script still works even with all the material in different locations.

if (/cgi\-bin/i) {
print LOG "$dirxfer/$file : $_\n";
}

The modified string finally gets written to the new destination file.

print OUT;

If the header hasn't been written yet, we now insert it. A directory containing all the files named according to the keys in %sub gets referenced by header (filmhead in this case) for correct insertion by section.

if ($body < 10 & $body > 0 & !$perl) {
	if (!$headyet) {
		print OUT "<\/head>\n";
		$headyet = 1 ;
	}
	if ($realhead) { $folder = $realhead; }
	else { $folder = "siteinfo"; }
	open (NEWHEAD, "< $infodir/film/
filmhead.$folder");
	while (<NEWHEAD>) { print OUT; }
	close NEWHEAD;
	$body = 10;
}
	if ($remains) { print OUT $remains;
$remains = ""; }
}

When the end of the input file is reached, we dump the header. In Film.com's case, the footer was identical. Specific footers can be used by appending ".$folder" to the source name (filmfoot, below).

close IN;
open (NEWFOOT, "< $infodir/film/filmfoot");
while (<NEWFOOT>) { print OUT; }
close NEWFOOT;
print OUT "\n</HTML>\n";
close OUT;

If the file is a symbolic link, you probably want to preserve this exactly. It's difficult to extract the information in a symbolic link, so there's a bit of a kludge involved to grab the link name and the link to where. This information is also logged explicitly so it can be reviewed to make sure it's still correct when the site is finished.

} elsif (-l "$diropen/$file") {
	print LOG "symbolic link $diropen/
$file needs to be looked at\n";
	open (LINK, "ls -l $diropen/$file |");
	$link = <LINK>;
	close LINK;
	chop $link;
	/()/;
	$link =~ /\-> (.*)/;
	if ($1) { system("ln -s $1 $dirxfer/
$file"); }
} else { system("cp -p $diropen/$file $dirxfer/
$file"); }

$found = 0;

I hate leaving core files lying around, and if programs crash with a segment fault, you can wind up with them in strange places on occasion. So we just remove any ones we find.

if (-e "$diropen/core") { unlink "$diropen/core"; }
}
close LOG;
}

After reviewing the changed site, it's as easy as

mv /usr/www/film /usr/www/film.old
mv /usr/www/filmbeta /usr/www/film

to take the rewritten site online.

Changing

You'll note that despite the relative complexity of Migrate.pl, whole chunks of it are specific to migrating a site. If you remove all the migration-specific code, you're left with a core structure. There's a version of this at rewrite.html called Rewrite.pl. This version allows substitution from the root down (or any root you specify on the site) for smaller changes.

Looking back at the list of ideal features for a site management tool of this kind, you'll see that several of the items--fixing line endings, making things consistent across a site--can be accomplished by putting substitution statements in the input loop to rewrite the code. For instance, to fix the quote problem, you'd have something that looks like

open (IN, "< $diropen/$file");
@in = <IN>;
close IN;
open (OUT, "> $diropen/$file");
foreach (@in) {
	s/\"\;/"/gi;
	print OUT;
}
close OUT;

This is also the generic--and dangerous!--way to rewrite files "live." Typically, you should make a tar archive before attempting the high-wire act.

cd /usr/www/
tar cf /usr/mass-storage/site.tar
./sitepath

If you cause problems rather than solutions when rewriting the site, just

cd /usr/www
tar xf /usr/mass-storage/site.tar

Home Free

Typically, when I make the switchover, it's a big relief. Champagne doesn't pop because we always have the next project in the queue, but mental corks go off, for sure.

We like to stage our revised sites at separate Web addresses to test them--Film.com's new look and format were staged at beta.film.com for a few weeks while we fine-tuned the transition process. With Migrate.pl, we were able to try lots of different schemes, since we were writing to a separate directory, and see how successful they were before making the big plunge.

Some dozens of users were actively browsing when we made the switch on that site, and they were a little shocked to click and find themselves suddenly "through the looking glass."

Just a Few Comments

HTML comment tags are woefully underused. I've read other technology writers recently discussing the lack of annotation on HTML pages, and it's a problem. The more HTML that's created directly from applications, the less commenting you'll need to insert. But if you're doing it by hand, or even tweaking parts of the code by hand, take advantage of the ability to remind yourself later what in blazes you were intending. There's nothing worse than coming back to a statement in a program like

$$array{++$index[$i--]} = *point{sort keys %frog};

and trying to figure out from context what it means. HTML is more human-readable, but it still can be opaque.

In Perl you use "#" at the start of a line to comment out the rest of that line. In C and C++ a comment is opened with /* and closed with */.

HTML works more like C in that regard. The opener is . Everything in between is ignored by the browser.

Java, JavaScript, and other embedded languages also use comment tags to hide code from browsers that don't support that particular language (or even understand any language but HTML).

Comments can't be hidden from a user who views the source code for the HTML document, so anything you put in there is clearly visible, including code fragments. Some smart alecks insert "Hi, Mom!" or even technical hints in their comments.


Recursive: See Recursive

The migration script is highly recursive, so it's useful to do a little Programming 101 before you tackle its use. Recursion as a concept is simple: Loops in a program may nest inside themselves--that is, you may use information gleaned in a subroutine to pass down to the next level using the same subroutine.

In the case of this program, we're running down subdirectories of a filesystem until we hit the bottom of the chain--that is, a directory without any directories in it. When the routine reaches that point, it runs through all the files and processes them, then jumps up a level. In schematic terms, what's happening is like this:

directory a1
	directory b1
		directory c1
			file c1-1
			file c1-2
			file c1-3
		directory c2
			file c2-1
		file b1-1
		file b1-2
	directory b2
		file b2-1
	file a1-1
	file a1-2
directory a2
			etc.

Regardless of the alphabetizing or number of directories, the above is always how the routine proceeds. This particular script also creates a new directory structure as it goes, creating directories on the way down (so it would create a1, then b1, then c1, then populate c1 with the rewritten files; create directory c2, and so on).

You'll note in the subroutine dir_recur that several variables are declared "local." This preserves their state at each level of the recursion. The value is stored for that particular recursive loop and is not accessible globally at any part or level of recursion.



HTML5 Development Center


Recent Articles