Click to See Complete Forum and Search --> : How to copy WORD document to HTMl without dirty code?


toplisek
12-22-2007, 05:18 AM
I have many documents in WORD and need to know how to transform to HTML as WORd gives dirty code full of errors.

Need help if this is possible?
thank you

felgall
12-22-2007, 02:11 PM
I usually find that extracting the word content as plain text and then manually adding the HTML tags in is much faster than trying to strip out all of the code that Word places there.

Another alternative is to open the word document in openoffice and extract to HTML using that program instead which will then put far less garbage into the HTML than Word does (most of which is intended for rebuilding the document in a different version of Word).

Major Payne
12-23-2007, 02:44 PM
First, you need to download and install Microsoft Office HTML Filter (http://www.microsoft.com/downloads/details.aspx?FamilyID=209ADBEE-3FBD-482C-83B0-96FB79B74DED&displaylang=ENhttp://www.microsoft.com/downloads/details.aspx?FamilyID=209ADBEE-3FBD-482C-83B0-96FB79B74DED&displaylang=EN). Then open your document in Word, go to File menu, select "Export To" and select "Compact HTML". Using Office HTML Filter, click Add button and add the HTML document saved, then click Apply button. Just beware that the stripping this way produces strange results on more complicated documents and you'll still have to go through with an HTML Editor to get it correctly done.

Now that that answers your question directly, I would do it Fengall's way. Actually, I do it by saving my document in html form, opening it in my HTML Editor, highlighing all the office stuff, using Delete key a lot, then the Remove/Replace function in the editor to clean out a lot of the other inline styling for tags I don't want. Then I customize page the way I want, validate it and I'm done. Probably a longer way my way, but I love this stuff so don't mind.

Ron

kiwibrit
12-24-2007, 06:25 AM
I usually find that extracting the word content as plain text and then manually adding the HTML tags in is much faster than trying to strip out all of the code that Word places there......

Exactly what I do, too.

felgall
12-25-2007, 05:35 PM
The filters for producing so called "compact" HTML from Word removes much of the code intended for rebuilding word documents but will still leave you with a significant quantity of garbage that still needs to be recoded manually. The HTML options in Word are primarily there to allow Word documents to be converted into a format that can be successfully read into different versions of Word and are not intended for actual web use (that's why they are full of conditional comments testing for which version of Microsoft Office you are opening the files in).

OpenOffice will give a cleaner HTML output from a Word document than using a filter on Wird itself.

toicontien
12-26-2007, 02:43 PM
You could try downloading the Nvu editor, then copying the text in Word and pasting it into a new HTML document in Nvu. Nvu is free. Download (http://nvudev.com/download.php)

Major Payne
12-26-2007, 04:21 PM
Another good idea. LOL

Ron

felgall
12-26-2007, 05:12 PM
You have to be careful copying from word into NVU that you don't end up copying all the garbage along with the text. The best way is to extract to a .txt file first before pasting into NVU.

infinityspiral
12-26-2007, 05:58 PM
If you go the text editor route I've found that saving as type ANSI from notepad successfully converts many garbage characters like commas, quotes, hyphens, ghost spaces, and apostrophes. It however does not covert (TM) or copyright symbols and instead of converting (...) to its html character it changes it to 3 periods.

The fastest method I've found so far is writing a script on the test server and dumping the word formated html docs into a watch folder where a server side script runs through each file automatically and handles the clean conversion. Mine was done with php.

Not a bad tool to have really- lots of clients want old content ported from old sites to new sites and this mass conversion is a very common problem.

root123
12-28-2007, 04:55 AM
take the help of HTML editor---Dreamweaver

ehcraig
06-07-2010, 02:18 PM
Hi Chris,

Is there a way to get something like your php type file that goes through the whole website & cleans the garbage out?

Is there a way to clean up a whole site at a time, instead of just a file at a time?

Or, do you have or know of a tool that can do that?


If you go the text editor route I've found that saving as type ANSI from notepad successfully converts many garbage characters like commas, quotes, hyphens, ghost spaces, and apostrophes. It however does not covert (TM) or copyright symbols and instead of converting (...) to its html character it changes it to 3 periods.

The fastest method I've found so far is writing a script on the test server and dumping the word formated html docs into a watch folder where a server side script runs through each file automatically and handles the clean conversion. Mine was done with php.

Not a bad tool to have really- lots of clients want old content ported from old sites to new sites and this mass conversion is a very common problem.

toplisek
06-08-2010, 06:37 AM
Interesting that nobody codes form to remove all so callled junk from WORD documents.,...

ehcraig
06-08-2010, 07:23 AM
I know of lots of code cleaners, but they are all for individual pages. And they also only work with creating NEW pages, so you need the original Word Doc usually.

What I need is a way to cleanup existing code. The new servers that are hosting our content is showing the pages with ?, ??, ??? all thoughout the pages. So I need something that can do the find & replace of ALL these errors all at once, by specifying the website or the folders throughout the site.

I am sure I am not the first to need it?? :eek:

criterion9
06-08-2010, 07:42 AM
Well....usually companies will hire an html coder to fix the bad word mistakes...at least that is how it works at the fortune 500 company I work at.

toplisek
06-08-2010, 07:45 AM
I agree.

ehcraig
06-08-2010, 10:21 AM
Yes, we are looking to that in the long run,
But right now we have repurposed our search & am looking now for:

We are looking for Apache Server experts opinions? Are you one or do you know of any you can recommend? Can I possible just run a few questions by this group?

Basically, “MS Formatting code in submitted articles are showing up with errors on our new hosting server.”
Any recommendations to fix via server configurations would be appreciated.

criterion9
06-08-2010, 12:43 PM
You might try setting an appropriate encoding-type header for the content. It should limit the ? symbols at least for IE in the appropriate region. Still, the best solution is to actually fix the codes (which isn't quite as easy as direct translating because MS throws a bunch of junk styles into all exported documents that have to be manually altered to achieve the desired look).