I'm trying to develop a tool that will allow me to uniquely ID a web page. This is so that I can assign a unique ID to a web page even if it doesn't necessarily have any a unique ID built into it.
I can't just look at the URL because I need it to be able to tell the difference between the different layouts that are possible within the same URL - for example, if the content of the page changes due to user input then I need to be able to tell the difference between the page with the new layout and the original page.
I tried stripping the dynamic content from the page and generating a hash from the source code but the IDs that I obtained changed when the dynamic content changed. I suppose this is because I was stripping all content from container objects (divs etc), but obviously container objects can be added as dynamic content thus changing the hash!
Any ideas on how to go about this would be greatly appreciated! Thanks.
Perhaps if you specify what need you have for this sort of ID, someone could be of greater assistance.
In the meantime, I suggest doing the same type of thing you're doing now. Just run the stripped down version of the page through a set of "normalizing" routines. That should mostly just mean condensing whitespace. I can't imagine what other variances would be insignificant enough to ignore in your hash.
Though, you could use a more complex, if needs be: parse the page as XML and generate a "hash" from the resulting data structure itself.