Click to See Complete Forum and Search --> : HTML Parser Logic


Ofekmeister
11-07-2010, 11:06 PM
Hello everyone! I've written a program that parses HTML files to see if the tags are balanced, i.e. all start tags have an end. Basically, it reads the file line by line and prints and puts all valid tags in an array (called queue). It then iterates over the queue: if it's a start tag, it's added to a different array (called stack) and removed from queue; if it's a self-closing tag, it's printed and removed from queue; if it's an end tag, it is compared to the most recently added start tag in the stack. If the start tag minus attributes and white space equals the end tag minus backslash, they're printed and removed from queue and stack, else program aborts. I have a bunch of error checking that goes on too.

I know that was a lot :)

Now my question:

How do actual parsers like Firefox do it? My program works extremely well but, since it goes line by line, it doesn't recognize individual tags that span multiple lines as tags, so I have to throw an error for that. I was thinking about putting the whole file in one string, but that would be largely inefficient. Or when a < is encountered, everything starts getting stored in a temporary array until a > is encountered. At that point, the strings are concatenated and checked to see if it's a valid tag, then put in queue.

Any thoughts?

This just a general question about how pro's do it, but here's my python program if you want to see parser.py (http://userpages.umbc.edu/~ofek1/parser.py)

zimonyi
11-08-2010, 06:33 AM
Why would it be inefficient to put the whole file into one string?

How do you handle tags with no end tag (like <br>)?

Many HTML end tags are optional, you cannot count on them all being there, so for what reason are you doing this? Would it not be easier (albeit perhaps less fun) to simply send your file to one of the online validators and work with the result of the validation?

Archie

eval(BadCode)
11-09-2010, 09:25 PM
html does not use \n
when you view the source code, only humans care about the \n.

There's no point in trying to break the file up line by line, the lines mean almost nothing to an html parser (ok javascript might get messed up), the only important thing is that the lines are not within a tag itself.

<sp
an>

I would suggest two fixes for your program.
A function to check for "\n" inside of tags, so for every "<" a ">" comes before a "\n". If those all check out, then

source.replace("\n","")

and it will only read 1 line and make 1 array.

an xhtml validator would be harder to make, a plain html validator shouldn't be as hard. You might also have some trouble if your code calls javascript functions and passes an argument like this:

<div id="parse_this" onmouseover="javascript_function(">","<",">>>>");"></div>

You might also consider forming an array of lines, then an array with all tags from all lines, then an array of validated tags, and use recursion to validate them and pop those tags from the queue. I would think it involves picking the last opening tag and the first closing tag for unique tag pairs.

Best of luck

Ofekmeister
11-10-2010, 10:49 PM
Thank you both for the feedback! Good thinking eval, I wish I checked WD sooner. A few hours after posting I did exactly that :) I completely forgot about JS calls though, thanks for that. RegEx is my friend haha. Why would an xhtml validator be harder to make, purely because there are more rules? Also, I LOVE (seriously) criticism regarding my code so know what's efficient, etc.

parser.py (http://userpages.umbc.edu/~ofek1/parser.py)