Problem description: Recently, JTidy and xslt have been used for extracting structured web page information. When processing pages that contain many scripts, JTidy fails to be de-dirty, prompting for exceptions in the title. Finally, we found that the problem occurred when parsing the script because of some nonstandard content in the script, which caused the above exception to fail to be judged. Solution: At first we wanted to solve this problem by modifying the JTidy source code, but later we found that the feasibility was not high. One was that modifying the source code may lead to other problems. In addition, it takes a long time to see the source code. Therefore, we finally chose to use preprocessing to delete the script. Code [java] public static String getFilterBody (String strBody) {// htmlparser parses Parser parser = Parser. createParser (strBody, "UTF-8"); NodeList list; String reValue = strBody; try {list = parser. parse (null); visitNodeList (list); reValue = list. toHtml ();} catch (ParserException e1) {} return reValue;} // recursively filters private static void visitNodeList (NodeList list) {for (int I = 0; I <list. size (); I + +) {Node node = list. elementAt (I); if (node instanceof Tag) {if (node instanceof ScriptTag) {list. remove (I); continue;} // you can add the deleted Tag if (node instanceof StyleTag) {list. remove (I); continue;} // here you can add the deleted Tag} NodeList children = node. getChildren (); if (children! = Null & children. size ()> 0) visitNodeList (children) ;}} but the same problem occurs when the script is deleted, that is, the parsing script is disordered, identify tags in some scripts as normal tags. For example, '<span> </span>' in <script> is identified as the end of the script, resulting in incomplete acquisition of the script, the solution is found on the Internet after Incomplete deletion. The following two parameters are used to parse the html Script handling problem [java] org.html parser. scanners. scriptmetadata. STRICT = false; org.html parser. lexer. lexer. STRICT_REMARKS = false; you only need to configure one of them. The following is an official description of org.html parser. scanners. scriptmetadata. STRICT = false; [java]/*** Strict parsing of CDATA flag. * If this flag is set true, the parsing of script is perform Ed without * regard to quotes. this means that erroneous script such as: * <pre> * document. write ("</script> "); * </pre> * will be parsed in strict accordance with appendix * <a href = "http://www.w3.org/TR/html4/appendix/notes.html#notes-specifying-data" mce_href = "http://www.w3.org/TR/html4/appendix/notes.html#notes-specifying-data"> * B .3.2 Specifying non-HTML data </a> the * <a href = "http: // www .W3.org/TR/html4/ "mce_href =" http://www.w3.org/TR/html4/ "> HTML 4.01 Specification </a> and * hence will be split into two or more nodes. correct javascript wocould * escape the ETAGO: * <pre> * document. write ("<// script>"); * </pre> * If true, CDATA parsing will stop at the first ETAGO ("</") no matter * whether it is quoted or not. if false, balanced quotes (either single or * double) will shield ETAGO. beacuse of the possibility of quotes within * single or multiline comments, these are also parsed. in most cases, * users prefer non-strict handling since there is so much broken script * out in the wild. */org.html parser. lexer. lexer. STRICT_REMARKS = false; [java]/*** Process remarks strictly flag. * If <code> true </code>, remarks are not terminated by --- $ gt; * or --! $ Gt;, I. e. more than two dashes. if <code> false </code>, * a more lax (and closer to typical browser handling) remark parsing * is used. * Default <code> true </code>. */by default, htmlparser is parsed according to strict html standards. Therefore, an error may occur when an unstandard tag is encountered. When the preceding two parameters are changed, htmlparser resolution is no longer strict and can cope with all possible situations.