Click to See Complete Forum and Search --> : Perl valid HTML syntax correcter


Ultimater
10-17-2005, 04:35 AM
FYI, I'm trying to enhance my understanding of regular expressions in Perl, by creating an HTML missing tag validator. Since my goal in this project is simply to learn and get more experience with regular expressions, the subroutine that I'm after only needs to detect missing TD tags and I'm not worried about making a Mr. Perfect subroutine -- I'm just curious how to approach this.

What I would like Perl to do is start out with a string containing HTML, which will be missing closing TD tags all over the place, and then auto-fill-in all the needed closing TD tags before ending the TRs. Then after applying the changes to the string, print the validated HTML to the client.


#!/usr/bin/perl -w
use strict;
use CGI qw(:standard);
use warnings FATAL => 'all';
use CGI::Carp qw(fatalsToBrowser);
print header;


my @mydata=<DATA>;
my $body=join('',@mydata);

=for comment
At this point, run a regular expression and/or some loops etc
that will modify the varaible $body and add-in a
closing TD tag prior to all closing TR tags
and if the TD is still open and only if the TD wouldn't end otherwise
apply the change to $body so that valid HTML will be printed to the client.
(Although this example omits the DOCTYPE and other required tags,
I'm trying to learn Perl and I already know HTML -- I omitted them on purpose
to make the code smaller and more isolated)
=cut

print $body;

__DATA__
<head>
<style type="text/css">
.t{padding:0;margin:0;border:0; width:200px;}
.q{color:green; background-color:red;font-weight:bold;}
.s{color:red; background-color:green;font-weight:bold;}
</style>
</head>
<body>
<p>
<table class="t" cellspacing="0" cellpadding="0">
<tr>
<td width="25%" class=s>
abc
<td width="25%" class="q">
def
<td width="25%" class="s">
ghi</td>
<td width="25%" class="q">
jkl
</tr>
</table>
</p>
</body>

fireartist
10-17-2005, 07:14 AM
If you're interested in learning regular expressions, I recommend O'Reillys Mastering Regular Expressions, it's very thorough.

Ultimater
10-17-2005, 10:44 AM
Thanks for the reply fireartist! Looks like a great book to buy, I'll most-likely pick it up soon.

But until then, I'm still curious how to approach the above.

Jeff Mott
10-17-2005, 11:19 PM
If you're attempting to parse an entire language then you probably don't want regular expressions. They would work but its size would be massive. Even a regular expression for properly parsing a URI would span several pages horizontally. It would be much easier, and ultimately more maintainable if you read your input one token at a time then pass this token to subroutines that tests if it matches part of the grammar. It would take a bit to explain and I don't know a good book for it at the moment. If you're interested at all you could possibly google for "parsing techniques".

fireartist
10-18-2005, 05:08 AM
What I would like Perl to do is start out with a string containing HTML, which will be missing closing TD tags all over the place, and then auto-fill-in all the needed closing TD tags before ending the TRs. Then after applying the changes to the string, print the validated HTML to the client.


As long as you realise that it's only a learning experience, and that there's no way it could handle real invalid HTML that's in the wild, and that there's too many browser quirks to keep track of, then I think you're already half way there, as you've already broken the problem down quite well.


Are we in a TR?
Y - Are we in a TD?
Y - Is there a /TD before the next TD or /TR?
Y - go to next TD
N - Add one
N - go to next TD
N - go to next TR


Perl regular expressions quick start (http://perldoc.perl.org/perlrequick.html)
Perl regular expressions tutorial (http://perldoc.perl.org/perlretut.html)
Perl Regular Expressions Reference (http://perldoc.perl.org/perlreref.html)

Ultimater
10-20-2005, 12:59 AM
Alright guys, I got something going here:

#!/usr/bin/perl -w
use strict;
use CGI qw(:standard);
use warnings FATAL => 'all';
use CGI::Carp qw(fatalsToBrowser);
print header;


my @mydata=<DATA>;
my $body=join('',@mydata);








my $body_clone=$body;
my @changes2make;
my $openTABLEs=0;
my $openTRs=0;
my $openTDs=0;
my $openTHEADs=0;
my $openTFOOTs=0;
my $openTBODYs=0;
my $openTHs=0;


while ($body_clone =~ /((<table[^>]*>)|(<thead[^>]*>)|(<\/thead>)|(<tfoot[^>]*>)|(<\/tfoot>)|(<tbody[^>]*>)|(<tr[^>]*>)|(<td[^>]*>)|(<\/td>)|(<\/tr>)|(<\/tbody>)|(<\/table>))/gi) {
my $l=length($1);
my $e=pos $body_clone;
my $s=$e-$l;

=for comment
$s is start
$e is end
$l is length of string
=cut

if($2){$openTABLEs++;} #(<table[^>]*>)
if($3){$openTHEADs++;} #(<thead[^>]*>)
if($4){$openTHEADs--;} #(<\/thead>)
if($5){$openTFOOTs++;} #(<tfoot[^>]*>)
if($6){$openTFOOTs--;} #(<\/tfoot>)
if($7){$openTBODYs++;} #(<tbody[^>]*>)
if($8){$openTRs++;} #(<tr[^>]*>)
if($9){$openTDs++;} #(<td[^>]*>)
if($10){$openTDs--;} #(<\/td>)
if($11){$openTRs--;} #(<\/tr>)
if($12){$openTBODYs--;} #(<\/tbody>)
if($13){$openTABLEs--;} #(<\/table>)




}



















#print $body;
print "\$openTABLEs: $openTABLEs<br>
\$openTHEADs: $openTHEADs<br>
\$openTFOOTs: $openTFOOTs<br>
\$openTBODYs: $openTBODYs<br>
\$openTRs: $openTRs<br>
\$openTDs: $openTDs<br>
\$openTHs: $openTHs
";







__DATA__
<head>
<style type="text/css">
.t{padding:0;margin:0;border:0; width:200px;}
.q{color:green; background-color:red;font-weight:bold;}
.s{color:red; background-color:green;font-weight:bold;}
</style>
</head>
<body>
<p>
<table class="t" cellspacing="0" cellpadding="0">
<tr>
<td width="25%" class=s>
abc
<td width="25%" class="q">
def
<td width="25%" class="s">
ghi</td>
<td width="25%" class="q">
jkl
</tr>
</table>
</p>
</body>


All it does is count all the open tags. I hope to come up with a solution real soon... My Perl string manipulation skills are wack.

fireartist
10-20-2005, 08:49 AM
I wouldn't trust this without testing it on a very wide range of input, but as a practical exercise, it 'works' with your html.

use strict;
use warnings;

my $html = do {local $/; <DATA>};

$html =~
s{
( # $1
<td\b # opening TD
)
( # $2
(?: # non-capturing
(?! </?td ) # negative lookahead if it's not an opening or
. # closing TD, then match anything '.'
)* # zero or more times (greedy)
)
(?= # positive lookahead
(?: <td\b ) # either opening TD
| # or
(?: </\s*tr\b ) # closing TR
)
}
{$1$2</td>\n}sxg; # replace with our match + the closing TD

print $html;

__DATA__
<head>
<style type="text/css">
.t{padding:0;margin:0;border:0; width:200px;}
.q{color:green; background-color:red;font-weight:bold;}
.s{color:red; background-color:green;font-weight:bold;}
</style>
</head>
<body>
<p>
<table class="t" cellspacing="0" cellpadding="0">
<tr>
<td width="25%" class=s>
abc
<td width="25%" class="q">
def
<td width="25%" class="s">
ghi</td>
<td width="25%" class="q">
jkl
</tr>
</table>
</p>
</body>

outputs...

<head>
<style type="text/css">
.t{padding:0;margin:0;border:0; width:200px;}
.q{color:green; background-color:red;font-weight:bold;}
.s{color:red; background-color:green;font-weight:bold;}
</style>
</head>
<body>
<p>
<table class="t" cellspacing="0" cellpadding="0">
<tr>
<td width="25%" class=s>
abc
</td>
<td width="25%" class="q">
def
</td>
<td width="25%" class="s">
ghi</td>
<td width="25%" class="q">
jkl
</td>
</tr>
</table>
</p>
</body>

fireartist
10-20-2005, 08:54 AM
ok, and if you really wanted to keep to cuddled /TD like in your example, then:

use strict;
use warnings;

my $html = do {local $/; <DATA>};

$html =~
s{
( # $1
<td\b # opening TD
)
( # $2
(?: # non-capturing
(?! </?td ) # negative lookahead if it's not an opening or
. # closing TD, then match anything '.'
)*? # zero or more times (non-greedy)
)
( \r?\n? ) # $3
(?= # positive lookahead
(?: <td\b ) # either opening TD
| # or
(?: </\s*tr\b ) # closing TR
)
}
{$1$2</td>$3}sxg; # replace with our match + the closing TD

print $html;

__DATA__
<head>
<style type="text/css">
.t{padding:0;margin:0;border:0; width:200px;}
.q{color:green; background-color:red;font-weight:bold;}
.s{color:red; background-color:green;font-weight:bold;}
</style>
</head>
<body>
<p>
<table class="t" cellspacing="0" cellpadding="0">
<tr>
<td width="25%" class=s>
abc
<td width="25%" class="q">
def
<td width="25%" class="s">
ghi</td>
<td width="25%" class="q">
jkl
</tr>
</table>
</p>
</body>

...

<head>
<style type="text/css">
.t{padding:0;margin:0;border:0; width:200px;}
.q{color:green; background-color:red;font-weight:bold;}
.s{color:red; background-color:green;font-weight:bold;}
</style>
</head>
<body>
<p>
<table class="t" cellspacing="0" cellpadding="0">
<tr>
<td width="25%" class=s>
abc</td>
<td width="25%" class="q">
def</td>
<td width="25%" class="s">
ghi</td>
<td width="25%" class="q">
jkl</td>
</tr>
</table>
</p>
</body>

Ultimater
10-20-2005, 08:58 AM
Thanks for the reply, Fireartist, that is a freaking awesome regular expression! I was working with around a hundred lines of code to do that! Wow, what a powerful regular expression!

Ultimater
10-20-2005, 09:20 AM
I wouldn't trust this without testing it on a very wide range of input, but as a practical exercise, it 'works' with your html.
I gave it a bunch of testing and it works like a dream in each case, it seems all it lacks is the "i" flag. I should really look into the ?! extended construct, it seems mighty powerful -- just saved my neck 100 lines of code, easy.