I usually find that extracting the word content as plain text and then manually adding the HTML tags in is much faster than trying to strip out all of the code that Word places there.
Another alternative is to open the word document in openoffice and extract to HTML using that program instead which will then put far less garbage into the HTML than Word does (most of which is intended for rebuilding the document in a different version of Word).
First, you need to download and install Microsoft Office HTML Filter. Then open your document in Word, go to File menu, select "Export To" and select "Compact HTML". Using Office HTML Filter, click Add button and add the HTML document saved, then click Apply button. Just beware that the stripping this way produces strange results on more complicated documents and you'll still have to go through with an HTML Editor to get it correctly done.
Now that that answers your question directly, I would do it Fengall's way. Actually, I do it by saving my document in html form, opening it in my HTML Editor, highlighing all the office stuff, using Delete key a lot, then the Remove/Replace function in the editor to clean out a lot of the other inline styling for tags I don't want. Then I customize page the way I want, validate it and I'm done. Probably a longer way my way, but I love this stuff so don't mind.
I usually find that extracting the word content as plain text and then manually adding the HTML tags in is much faster than trying to strip out all of the code that Word places there......
The filters for producing so called "compact" HTML from Word removes much of the code intended for rebuilding word documents but will still leave you with a significant quantity of garbage that still needs to be recoded manually. The HTML options in Word are primarily there to allow Word documents to be converted into a format that can be successfully read into different versions of Word and are not intended for actual web use (that's why they are full of conditional comments testing for which version of Microsoft Office you are opening the files in).
OpenOffice will give a cleaner HTML output from a Word document than using a filter on Wird itself.
You have to be careful copying from word into NVU that you don't end up copying all the garbage along with the text. The best way is to extract to a .txt file first before pasting into NVU.
If you go the text editor route I've found that saving as type ANSI from notepad successfully converts many garbage characters like commas, quotes, hyphens, ghost spaces, and apostrophes. It however does not covert (TM) or copyright symbols and instead of converting (...) to its html character it changes it to 3 periods.
The fastest method I've found so far is writing a script on the test server and dumping the word formated html docs into a watch folder where a server side script runs through each file automatically and handles the clean conversion. Mine was done with php.
Not a bad tool to have really- lots of clients want old content ported from old sites to new sites and this mass conversion is a very common problem.
Is there a way to get something like your php type file that goes through the whole website & cleans the garbage out?
Is there a way to clean up a whole site at a time, instead of just a file at a time?
Or, do you have or know of a tool that can do that?
Originally Posted by infinityspiral
If you go the text editor route I've found that saving as type ANSI from notepad successfully converts many garbage characters like commas, quotes, hyphens, ghost spaces, and apostrophes. It however does not covert (TM) or copyright symbols and instead of converting (...) to its html character it changes it to 3 periods.
The fastest method I've found so far is writing a script on the test server and dumping the word formated html docs into a watch folder where a server side script runs through each file automatically and handles the clean conversion. Mine was done with php.
Not a bad tool to have really- lots of clients want old content ported from old sites to new sites and this mass conversion is a very common problem.
I know of lots of code cleaners, but they are all for individual pages. And they also only work with creating NEW pages, so you need the original Word Doc usually.
What I need is a way to cleanup existing code. The new servers that are hosting our content is showing the pages with ?, ??, ??? all thoughout the pages. So I need something that can do the find & replace of ALL these errors all at once, by specifying the website or the folders throughout the site.
Bookmarks