Word can be saved directly as HTM, but even if you save it as HTML there will be a lot of waste code. I used to work with Dreamweaver clean up HTML, first dealing with word-specific tags, and then deleting some font,b,span. Further, in the editplus with regular processing, finally get the clean HTML code i want. Of course, the most perfect way is to copy the text out of the text editor to write their own HTM tags,:
Today I see lifehacker these several word 2 clean HTM methods:
1. Use this HTML Tidy Library project open source software to process.
2. The Microsoft Official site also has an Office HTML Filter 2.0 tool that can be used to process unwanted code that occurs when word2000 turns HTML.
3. Use this word HTML Cleaner online tool to process. Only the following versions of word2000 can be processed.
4. A regular expression has been given (in fact, the various software above is also used to solve the same)
Delete unwanted labels
<[/]? (font|span|xml| [ovwxp]:w+) [^>]*?>
-Replace any matches with the empty string
Delete Class,style ... and other unwanted attributes.
< ([^>]*) (?: class|lang|style|size|face|[ ovwxp]:w+) = (?: ' [^ ']* ' |] [^""]*""| [^>]+) ([^>]*) >
-Replace any matches with <$1$2>
Detailed explanation in the clean Word HTML using Regular Expressions