Parse Html Chinese documents using Jsoup

Last Update:2015-09-12 Source: Internet

Author: User

Tags baseuri

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Parse Html Chinese documents using Jsoup
1. parse and traverse an HTML document how to parse an HTML document: string html = "

Use Jsoup. parseBodyFragment (String html) method. string html = "<div> <p> Lorem ipsum. </p> "; Document doc = Jsoup. parseBodyFragment (html); Element body = doc. body ();

The parseBodyFragment method creates an empty shell document and inserts the parsed HTML into the body element. Assume that you are using a normal Jsoup. the parse (String html) method usually returns the same result, but the user input is treated as a body segment, to ensure that any bad HTML provided by the user will be parsed into a body element. The Document. body () method can retrieve all child elements of the Document body element, which is the same as doc. getElementsByTag ("body. Secure Stay safe if you can allow users to input HTML content, be careful to avoid cross-site scripting attacks. The Whitelist-based cleaning tool and the clean (String bodyHtml, Whitelist whitelist) method are used to clear malicious content entered by users. 4. loading a Document from a URL has a problem. You need to obtain and parse an HTML Document from a website and search for relevant data. You can use the following solution: Use Jsoup. connect (String url) method: Document doc = Jsoup. connect ("http://example.com /"). get (); String title = doc. title (); indicates that the connect (String url) method creates a new Connection, and get () to obtain and parse an HTML file. If an error occurs when retrieving HTML from the URL, an IOException is thrown and should be handled as appropriate. The Connection interface also provides a method chain to solve special requests, as follows:

Document doc = Jsoup. connect ("http://example.com "). data ("query", "Java "). userAgent ("Mozilla "). cookie ("auth", "token "). timeout (3000 ). post (); this method only supports Web URLs (http and https protocols). If you need to load data from a File, you can use parse (File in, String charsetName) instead.

5. loading a document from a file. The problem is that there is an HTML file on the local hard disk. You need to parse the file to extract data or modify it. You can use static Jsoup. parse (File in, String charsetName, String baseUri) method: File input = new File ("/tmp/input.html"); Document doc = Jsoup. parse (input, "UTF-8", "http://example.com/"); Description parse (File in, String charsetName, String baseUri) This method is used to load and parse an HTML File. If an error occurs during file loading, an IOException is thrown and should be handled as appropriate. The baseUri parameter is used to solve the problem that the URLs in the file is a relative path. You can input an empty string if you do not need it. Another method is parse (File in, String charsetName), which uses the File path as the baseUri. This method is applicable if the resolved file is located in the local file system of the website and the link also points to the file system. 6. Use the DOM method to traverse a document. You have an HTML document to extract data from and understand the structure of this HTML document. After the method parses HTML into a Document, you can use a method similar to DOM for operations. Sample Code:

File input = new File("/tmp/input.html");Document doc = Jsoup.parse(input, "UTF-8", "http://example.com/");Element content = doc.getElementById("content");Elements links = content.getElementsByTag("a");for (Element link : links) {  String linkHref = link.attr("href");  String linkText = link.text();}

It indicates that the Elements object provides a series of DOM-like methods to find Elements, extract and process the data. The details are as follows: getElementById (String id) getElementsByTag (String tag) getElementsByClass (String className) getElementsByAttribute (String key) (and related methods) Element siblings: siblingElements (), firstElementSibling (), lastElementSibling (); nextElementSibling (), previuselementsibling () Graph: parent (), children (), child (int index) Element Data attr (String key) get attribute attr (String key, String value) set attribute attributes () Get all Attribute id (), className () and classNames () text () Get text content text (String value) set text content html () Get HTMLhtml (String value) in element) set the HTML content in the element outerHtml () to get the HTML content outside the element data () to get the data content (for example: script and style tags) tag () and tagName () operation HTML and text append (String html), prepend (String html) appendText (String text), prependText (String text) appendElement (String tagName), prependElement (String tagName) html (String value) 7. you can use the selector syntax to find element problems. Ry syntax. The method can be implemented using the Element. select (String selector) and Elements. select (String selector) methods:

File input = new File ("/tmp/input.html"); Document doc = Jsoup. parse (input, "UTF-8", "http://example.com/"); Elements links = doc. select ("a [href]"); // element a with the href attribute Elements pngs = doc. select ("img?src==.png]"); // The Image Element masthead = doc. select ("div. masthead "). first (); // class equals to the div tag Elements resultLinks = doc. select ("h3.r> a"); // a element after h3

It indicates that the jsoup elements object supports the selector syntax similar to CSS (or jquery) to implement very powerful and flexible search functions .. This select method can be used in Document, Element, or Elements objects. It is context-related. Therefore, you can filter specified elements or select access through a chain. The Select method returns an Elements set and provides a set of methods to extract and process the results. Selector overview tagname: Search for elements by tag, such as ans | tag: Search for elements by tag in the namespace. For example, you can search by fb | name syntax. <fb: name> element # id: Search for elements by ID, for example, # logo. class: searches for elements by class name, such :. masthead [attribute]: searches for elements using attributes, such as [href] [^ attr]: searches for elements using attribute name prefixes, such: you can use [^ data-] to find the element with the HTML5 Dataset attribute [attr = value]: Use the attribute value to find the element, for example: [width = 500] [attr ^ = value], [attr $ = value], [attr * = value]: searches for elements by matching the start, end, or end of a property value, for example, [href * =/path/] [attr ~ = Regex]: use attribute values to match regular expressions to find elements, such as: img [src ~ = (? I) \. (png | jpe? G)] *: This symbol will match all elements Selector selectors using el # id: Element + ID, for example: div # logoel. class: Element + class, for example, div. mastheadel [attr]: Element + class, for example, any combination of a [href], such as a [href]. highlightancestor child: searches for child elements under an element. For example, it can be used. body p searches for all p elements under the "body" element, parent> child: searches for direct child elements under a parent element. For example, you can use div. content> p to find the p element. You can also use body> * To find all the direct sub-elements under the body tag, siblingA + siblingB: to find the first peer Element B before Element A, for example, div. head + divsiblingA ~ SiblingX: finds the X element of the same level before Element A, for example, h1 ~ Pel, el, el: a combination of multiple selectors to find the unique element that matches any selector, for example, div. masthead, div. logo pseudo selector selectors: lt (n): Find which elements have the same level index value (its position in the DOM tree is relative to its parent node) less than n, such as: td: lt (3) indicates the element smaller than three columns: gt (n): Find which elements have the same-level index value greater than n, for example, div p: gt (2) indicates which divs contain more than two p elements: eq (n): finds which elements have the same-level index value equal to n, for example: form input: eq (1) form element: has (seletor) that contains an input Tag: searches for elements contained in the matching selector. For example, div: has (p) indicates which div contains the p element: not (selector): searches for elements that do not match the selector, for example, div: not (. logo) indicates the list of all divs that do not contain the class = "logo" element: cont Ains (text): searches for elements that contain the given text, such as p: contains (jsoup): containsOwn (text): searches for elements that directly contain the given text: matches (regex): Find which elements match the specified regular expression, for example, div: matches ((? I) login): matchesOwn (regex): searches for elements that contain text matching specified regular expressions. Note: The pseudo selector index starts from 0, that is to say, the index value of the first element is 0, and the index of the second element is 1. You can view the Selector API reference for more details. 8. if you want to extract attributes from elements, text and HTML, you want to obtain the data of a Document instance object after parsing and finding some elements. To obtain an attribute value, you can use Node. attr (String key) values () or Node. outerHtml () as an example:

String html = "<p> An <a href = 'HTTP: // example.com/'> <B> example </B> </a> link. </p> "; Document doc = Jsoup. parse (html); // parse the HTML string and return a Document to implement Element link = doc. select (""). first (); // search for the first a element String text = doc. body (). text (); // "An example link" // obtain the text String linkHref = link in the String. attr ("href"); // "http://example.com/" // get link address String linkText = link. text (); // "example" "// obtain the text String linkOuterH = link in the link address. outerHtml (); // "<a href =" http://example.com "> <B> example </B> </a>" String linkInnerH = link.html (); // "<B> example </B>" // obtain the html content in the link.

The preceding method is the core method for element data access. In addition, you can use Element. id () Element. tagName () Element. className () and Element. hasClass (String className) All these accessors have corresponding setter methods to change data. see the reference document for Element and Elements collection classes. URLs uses the CSS selector syntax to find Element 9. to solve the URLs problem, you have an HTML document containing the relative URL path. You need to convert these relative paths into the absolute URL. When parsing a document, make sure that the base URI is specified, and then use abs: attribute prefix to obtain the absolute path containing the base URI. The Code is as follows:

Document doc = Jsoup.connect("http://www.open-open.com").get();Element link = doc.select("a").first();String relHref = link.attr("href"); // == "/"String absHref = link.attr("abs:href"); // "http://www.open-open.com/

"NOTE: In HTML elements, URLs is often written as the relative path relative to the document location: <a href ="/download ">... </a>. when you use Node. when the attr (String key) method is used to obtain the href attribute of Element a, it directly returns the value specified in the HTML source code. If you want to obtain an absolute path, you need to add the abs: prefix before the attribute name. In this way, the URL address attr ("abs: href") containing the root path can be returned. Therefore, it is very important to define the base URI when parsing HTML documents. If you do not want to use abs: prefix, another method can implement the same function Node. absUrl (String key ). 10. Example program: Get all links this example program will show how to get a page from a URL. Then extract all links, images, and other auxiliary content on the page. Check the URL and text information. To run the following program, you must specify a URL as the parameter.

Import org. jsoup. jsoup; import org. jsoup. helper. validate; import org. jsoup. nodes. document; import org. jsoup. nodes. element; import org. jsoup. select. elements; import java. io. IOException;/*** Example program to list links from a URL. */public class ListLinks {public static void main (String [] args) throws IOException {// Validate. isTrue (args. length = 1, "usage: supply url to fetch"); String url = "http: // News.ycombinator.com/"; print (" Fetching % s... ", url); Document doc = Jsoup. connect (url ). get (); Elements links = doc. select ("a [href]"); // "a [href]" // Elements media = doc, a element with the href attribute. select ("[src]"); // use attributes to find Elements, such as [href] Elements imports = doc. select ("link [href]"); print ("\ nMedia: (% d)", media. size (); for (Element src: media) {if (src. tagName (). equals ("img") print ("* % s: <% s> % sx % s (% s)", s Rc. tagName (), src. attr ("abs: src"), src. attr ("width"), src. attr ("height"), trim (src. attr ("alt"), 20); // src. attr ("src") Result: <y18.gif> 18x18 () // src. attr ("abs: src") Results: 
 11. When setting the attribute value, you may want to modify some of the attribute values after parsing a Document, and save them to the disk or output them to the foreground page. You can use the attribute setting method Element. attr (String key, String value), and Elements. attr (String key, String value ). if you need to modify the class attribute of an Element, you can use Element. addClass (String className) and Element. removeClass (String className) method. Elements provides methods to operate element attributes and classes in batches. For example, to add a rel = "nofollow" for each a element in the div, you can use the following method: doc. select ("div. comments "). attr ("rel", "nofollow"); it indicates that, like other methods in Element, the attr method also returns when Element (or when using a selector, it returns the Elements set ). This makes it easy to use the method for writing. For example: oc. select ("div. masthead "). attr ("title", "jsoup "). addClass ("round-box"); 12. to set the HTML content of an Element, you need an HTML content method in the Element. You can use the HTML setting method in the Element as follows:
Element div = doc. select ("div "). first (); // <div> </div> div.html ("<p> lorem ipsum </p> "); // <div> <p> lorem ipsum </p> </div> div. prepend ("<p> First </p>"); // Add the html content div before the div. append ("<p> Last </p>"); // Add the html content after the div. // The added result: <div> <p> First </p> <p> lorem ipsum </p> <p> Last </p> </div> Element span = doc. select ("span "). first (); // <span> One </span> span. wrap ("<li> <a href = 'HTTP: // example.com/'> </a> </li>"); // The added result: <li> <a href = "http://example.com"> <span> One </span> </a> </li>
 Description: The Element.html (String html) method first clears the HTML content in the element and replaces it with the input HTML. Element. prepend (String first) and Element. the append (String last) method is used to add the HTML content Element before and after the Element's internal HTML. wrap (String around) wraps an external HTML content on the element. For more information, see the Element. prependElement (String tag) and Element. appendElement (String tag) methods in the API reference document. 13. To set the text content of an Element, You need to modify the text content in an HTML document. You can use the Element setting method ::
Element div = doc. select ("div "). first (); // <div> </div> div. text ("five> four"); // <div> five> four </div> div. prepend ("First"); div. append ("Last"); // now: <div> First five> four Last </div> description
 The text setting method is the same as the HTML setter method: Element. text (String text) clears the internal HTML content of an Element, and then the provided text replaces the Element. prepend (String first) and Element. append (String last) adds text nodes before and after the element's internal html. If the input text contains characters such as <,>, it is processed in text instead of HTML. 14. Eliminate untrusted HTML (to prevent XSS attacks). Users often provide comments when making websites. Some unfriendly users may make some scripts into the comments, which may damage the behavior of the entire page. More seriously, they need to obtain some confidential information and clear the HTML at this time, to avoid cross-site scripting (XSS) attacks ). You must specify a configurable Whitelist.
String unsafe = "<p> <a href = 'HTTP: // example.com/'onclick = 'alcookies () '> Link </a> </p> "; string safe = Jsoup. clean (unsafe, Whitelist. basic (); // now: <p> <a href = "http://example.com/" rel = "nofollow"> Link </a> </p> description
 XSS, also known as CSS (Cross Site Script), is a Cross-Site scripting attack. A malicious attacker inserts malicious html code into a Web page. When a user browses this page, the html code embedded in the Web page is executed, this achieves the Special Purpose of malicious attacks to users. XSS is a passive attack, because it is passive and difficult to use, so many people often ignore its dangers. Therefore, we often only allow users to enter plain text content, but the user experience is poor. A better solution is to use a Rich Text Editor WYSIWYG such as CKEditor and TinyMCE. These can be output in HTML and can be visually edited by users. Although they can perform verification on the client side, this is not safe enough. You need to verify and clear Harmful HTML code on the server side to ensure that the HTML entered to your website is safe. Otherwise, attackers can bypass Javascript verification on the client and inject insecure HMTL to your website. Jsoup's whitelist cleaner can filter user input HTML on the server side and output only some secure labels and attributes. Jsoup provides a series of basic Whitelist configurations that can meet most of the requirements. However, you can modify them if necessary, but be careful. This cleaner is very useful, not only to avoid XSS attacks, but also to limit the range of tags that users can enter.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More