Ofekmeister
11-07-2010, 11:06 PM
Hello everyone! I've written a program that parses HTML files to see if the tags are balanced, i.e. all start tags have an end. Basically, it reads the file line by line and prints and puts all valid tags in an array (called queue). It then iterates over the queue: if it's a start tag, it's added to a different array (called stack) and removed from queue; if it's a self-closing tag, it's printed and removed from queue; if it's an end tag, it is compared to the most recently added start tag in the stack. If the start tag minus attributes and white space equals the end tag minus backslash, they're printed and removed from queue and stack, else program aborts. I have a bunch of error checking that goes on too.
I know that was a lot :)
Now my question:
How do actual parsers like Firefox do it? My program works extremely well but, since it goes line by line, it doesn't recognize individual tags that span multiple lines as tags, so I have to throw an error for that. I was thinking about putting the whole file in one string, but that would be largely inefficient. Or when a < is encountered, everything starts getting stored in a temporary array until a > is encountered. At that point, the strings are concatenated and checked to see if it's a valid tag, then put in queue.
Any thoughts?
This just a general question about how pro's do it, but here's my python program if you want to see parser.py (http://userpages.umbc.edu/~ofek1/parser.py)
I know that was a lot :)
Now my question:
How do actual parsers like Firefox do it? My program works extremely well but, since it goes line by line, it doesn't recognize individual tags that span multiple lines as tags, so I have to throw an error for that. I was thinking about putting the whole file in one string, but that would be largely inefficient. Or when a < is encountered, everything starts getting stored in a temporary array until a > is encountered. At that point, the strings are concatenated and checked to see if it's a valid tag, then put in queue.
Any thoughts?
This just a general question about how pro's do it, but here's my python program if you want to see parser.py (http://userpages.umbc.edu/~ofek1/parser.py)