dcsimg
www.webdeveloper.com
Page 1 of 2 12 LastLast
Results 1 to 15 of 17

Thread: How to copy WORD document to HTMl without dirty code?

  1. #1
    Join Date
    Sep 2005
    Posts
    1,636

    Question How to copy WORD document to HTMl without dirty code?

    I have many documents in WORD and need to know how to transform to HTML as WORd gives dirty code full of errors.

    Need help if this is possible?
    thank you

  2. #2
    Join Date
    Mar 2005
    Location
    Sydney, Australia
    Posts
    7,974
    I usually find that extracting the word content as plain text and then manually adding the HTML tags in is much faster than trying to strip out all of the code that Word places there.

    Another alternative is to open the word document in openoffice and extract to HTML using that program instead which will then put far less garbage into the HTML than Word does (most of which is intended for rebuilding the document in a different version of Word).

  3. #3
    Join Date
    May 2005
    Location
    Gold Coast (MS)
    Posts
    2,219
    First, you need to download and install Microsoft Office HTML Filter. Then open your document in Word, go to File menu, select "Export To" and select "Compact HTML". Using Office HTML Filter, click Add button and add the HTML document saved, then click Apply button. Just beware that the stripping this way produces strange results on more complicated documents and you'll still have to go through with an HTML Editor to get it correctly done.

    Now that that answers your question directly, I would do it Fengall's way. Actually, I do it by saving my document in html form, opening it in my HTML Editor, highlighing all the office stuff, using Delete key a lot, then the Remove/Replace function in the editor to clean out a lot of the other inline styling for tags I don't want. Then I customize page the way I want, validate it and I'm done. Probably a longer way my way, but I love this stuff so don't mind.

    Ron

  4. #4
    Join Date
    Jun 2005
    Location
    United Kingdom
    Posts
    1,043
    Quote Originally Posted by felgall
    I usually find that extracting the word content as plain text and then manually adding the HTML tags in is much faster than trying to strip out all of the code that Word places there......
    Exactly what I do, too.

  5. #5
    Join Date
    Mar 2005
    Location
    Sydney, Australia
    Posts
    7,974
    The filters for producing so called "compact" HTML from Word removes much of the code intended for rebuilding word documents but will still leave you with a significant quantity of garbage that still needs to be recoded manually. The HTML options in Word are primarily there to allow Word documents to be converted into a format that can be successfully read into different versions of Word and are not intended for actual web use (that's why they are full of conditional comments testing for which version of Microsoft Office you are opening the files in).

    OpenOffice will give a cleaner HTML output from a Word document than using a filter on Wird itself.

  6. #6
    Join Date
    Feb 2003
    Location
    Michigan, USA
    Posts
    5,774

  7. #7
    Join Date
    May 2005
    Location
    Gold Coast (MS)
    Posts
    2,219
    Another good idea. LOL

    Ron

  8. #8
    Join Date
    Mar 2005
    Location
    Sydney, Australia
    Posts
    7,974
    You have to be careful copying from word into NVU that you don't end up copying all the garbage along with the text. The best way is to extract to a .txt file first before pasting into NVU.

  9. #9
    Join Date
    Mar 2007
    Location
    USA
    Posts
    449
    If you go the text editor route I've found that saving as type ANSI from notepad successfully converts many garbage characters like commas, quotes, hyphens, ghost spaces, and apostrophes. It however does not covert (TM) or copyright symbols and instead of converting (...) to its html character it changes it to 3 periods.

    The fastest method I've found so far is writing a script on the test server and dumping the word formated html docs into a watch folder where a server side script runs through each file automatically and handles the clean conversion. Mine was done with php.

    Not a bad tool to have really- lots of clients want old content ported from old sites to new sites and this mass conversion is a very common problem.

  10. #10
    Join Date
    Apr 2007
    Posts
    33
    take the help of HTML editor---Dreamweaver

  11. #11
    Join Date
    Jun 2004
    Location
    Washington, DC
    Posts
    8

    Word to HTML cleaner for whole website?

    Hi Chris,

    Is there a way to get something like your php type file that goes through the whole website & cleans the garbage out?

    Is there a way to clean up a whole site at a time, instead of just a file at a time?

    Or, do you have or know of a tool that can do that?


    Quote Originally Posted by infinityspiral View Post
    If you go the text editor route I've found that saving as type ANSI from notepad successfully converts many garbage characters like commas, quotes, hyphens, ghost spaces, and apostrophes. It however does not covert (TM) or copyright symbols and instead of converting (...) to its html character it changes it to 3 periods.

    The fastest method I've found so far is writing a script on the test server and dumping the word formated html docs into a watch folder where a server side script runs through each file automatically and handles the clean conversion. Mine was done with php.

    Not a bad tool to have really- lots of clients want old content ported from old sites to new sites and this mass conversion is a very common problem.

  12. #12
    Join Date
    Sep 2005
    Posts
    1,636
    Interesting that nobody codes form to remove all so callled junk from WORD documents.,...

  13. #13
    Join Date
    Jun 2004
    Location
    Washington, DC
    Posts
    8

    Re:Word to HTML cleaner for whole website?

    I know of lots of code cleaners, but they are all for individual pages. And they also only work with creating NEW pages, so you need the original Word Doc usually.

    What I need is a way to cleanup existing code. The new servers that are hosting our content is showing the pages with ?, ??, ??? all thoughout the pages. So I need something that can do the find & replace of ALL these errors all at once, by specifying the website or the folders throughout the site.

    I am sure I am not the first to need it??

  14. #14
    Join Date
    Jan 2009
    Posts
    3,346
    Well....usually companies will hire an html coder to fix the bad word mistakes...at least that is how it works at the fortune 500 company I work at.

  15. #15
    Join Date
    Sep 2005
    Posts
    1,636
    I agree.

Thread Information

Users Browsing this Thread

There are currently 1 users browsing this thread. (0 members and 1 guests)

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
HTML5 Development Center



Recent Articles