The StringIndexOutOfBoundsException exception occurs when parsing the Jtidy script.

Last Update:2013-12-09 Source: Internet

Author: User

Tags lexer

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Problem description: Recently, JTidy and xslt have been used for extracting structured web page information. When processing pages that contain many scripts, JTidy fails to be de-dirty, prompting for exceptions in the title. Finally, we found that the problem occurred when parsing the script because of some nonstandard content in the script, which caused the above exception to fail to be judged. Solution: At first we wanted to solve this problem by modifying the JTidy source code, but later we found that the feasibility was not high. One was that modifying the source code may lead to other problems. In addition, it takes a long time to see the source code. Therefore, we finally chose to use preprocessing to delete the script. Code [java] public static String getFilterBody (String strBody) {// htmlparser parses Parser parser = Parser. createParser (strBody, "UTF-8"); NodeList list; String reValue = strBody; try {list = parser. parse (null); visitNodeList (list); reValue = list. toHtml ();} catch (ParserException e1) {} return reValue;} // recursively filters private static void visitNodeList (NodeList list) {for (int I = 0; I <list. size (); I + +) {Node node = list. elementAt (I); if (node instanceof Tag) {if (node instanceof ScriptTag) {list. remove (I); continue;} // you can add the deleted Tag if (node instanceof StyleTag) {list. remove (I); continue;} // here you can add the deleted Tag} NodeList children = node. getChildren (); if (children! = Null & children. size ()> 0) visitNodeList (children) ;}} but the same problem occurs when the script is deleted, that is, the parsing script is disordered, identify tags in some scripts as normal tags. For example, '<span> </span>' in <script> is identified as the end of the script, resulting in incomplete acquisition of the script, the solution is found on the Internet after Incomplete deletion. The following two parameters are used to parse the html Script handling problem [java] org.html parser. scanners. scriptmetadata. STRICT = false; org.html parser. lexer. lexer. STRICT_REMARKS = false; you only need to configure one of them. The following is an official description of org.html parser. scanners. scriptmetadata. STRICT = false; [java]/*** Strict parsing of CDATA flag. * If this flag is set true, the parsing of script is perform Ed without * regard to quotes. this means that erroneous script such as: * <pre> * document. write ("</script> "); * </pre> * will be parsed in strict accordance with appendix * <a href = "http://www.w3.org/TR/html4/appendix/notes.html#notes-specifying-data" mce_href = "http://www.w3.org/TR/html4/appendix/notes.html#notes-specifying-data"> * B .3.2 Specifying non-HTML data </a> the * <a href = "http: // www .W3.org/TR/html4/ "mce_href =" http://www.w3.org/TR/html4/ "> HTML 4.01 Specification </a> and * hence will be split into two or more nodes. correct javascript wocould * escape the ETAGO: * <pre> * document. write ("<// script>"); * </pre> * If true, CDATA parsing will stop at the first ETAGO ("</") no matter * whether it is quoted or not. if false, balanced quotes (either single or * double) will shield ETAGO. beacuse of the possibility of quotes within * single or multiline comments, these are also parsed. in most cases, * users prefer non-strict handling since there is so much broken script * out in the wild. */org.html parser. lexer. lexer. STRICT_REMARKS = false; [java]/*** Process remarks strictly flag. * If <code> true </code>, remarks are not terminated by --- $ gt; * or --! $ Gt;, I. e. more than two dashes. if <code> false </code>, * a more lax (and closer to typical browser handling) remark parsing * is used. * Default <code> true </code>. */by default, htmlparser is parsed according to strict html standards. Therefore, an error may occur when an unstandard tag is encountered. When the preceding two parameters are changed, htmlparser resolution is no longer strict and can cope with all possible situations.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More