Word can be saved as HTML file, through this function, can quickly realize the Web page to display Word content, especially the form of editing, it contains TR, TD, Th, rowspan, colspan and other content, direct write more cumbersome.
But word converted HTML By default is with a lot of format code, then how to remove these redundant code, only the main content?
Originally intended to find tools from the Internet, but found that there is no ready-made, is generally recommended to use the tool text replacement to remove, so it can not be reused. Therefore, I used Nodejs to write a small piece of code, to remove redundant code.
The main ideas are:
- Nodejs reading the text contents of an HTML file
- Get table contents with substring function
- Remove excess labels with regular
- Remove excess attributes with regular
- Remove extra spaces with regular
varFS = require (' FS ')//Asynchronous ReadFs.readfile (' static/detail/county-hhz.html ',function(err, data) {if(err) {returnConsole.error (ERR); } //Step 1: Get table content varContent =data.tostring (); Content= Content.substring (Content.indexof ("<table"), Content.indexof ("</table>") + 8); //Step 2: Remove the excess labels[' span ', ' P ', ' o ', ' Font '].foreach (item ={content= Content.replace (NewRegExp (' <${item} (. *?) > (. *?) <\/${item}.*?> ', ' gi '), function (match, p1, p2) {returnP2; }); }) //Step 3: Remove the extra attribute elementsContent = Content.replace (/style= ". *?") /g, "");//Remove Style PropertyContent = Content.replace (/(Class|border|cellspacing| msonormaltable|valign|width|center| ) (=\s*)/g, ""); //Step 4: Remove the extra spaceContent = Content.replace (/(\s+) (\s+)/g,function(Match, p1, p2) {returnP1 + "; }) Content= Content.replace (/(\s) (>|<)/g,function(Match, p1, p2) {returnP2; }) console.log (content); });
Remove redundant code after Word table goes to HTML