I need something that will remove that garbage and leave proper HTML code alone.
Who doesn't?
You can TEST the HTML as xhtml by doing this:
SELECT TOP(1) html FROM table
FOR XML PATH(''), TYPE
If it fails- it's not valid xhtml. It it passes it works. Unfortunately that's all that comes to mind as far as an easy solution goes. You can try parsing the html yourself and using the self-closing or end tags as a way to signal a closing element- more specifically a valid element or list of valid elements. If you go this method I would strongly consider doing a depth first search; a broken parent DOM element will likely mean the child DOM element is also going to be parsed as broken.
Look around for XPATH forums or threads. This is an issue that people using XPATH will have- so you might be able to find something that has already been created. BeautifulSoup for Python already has a parse tree- and from what I hear it's amazing. If you know Python you might try translating it (and sharing it!!)
I use (, ; : -) as I please- instead of learning the English language specification: I decided to learn Scheme and Java;
Thanks for the suggestions, eval. The XPATH and Beautiful Soup are appealing. I do NOT know Python; but I can learn. And if there is a way to translate it into SQL, I'll do my best.
Bookmarks