Parse and operate HTML documents using jsoup)

Source: Internet
Author: User
Tags html interpreter

From http://www.ibm.com/developerworks/cn/java/j-lo-jsouphtml/

Introduction to jsoup

When the Java program parses HTML documents, I believe everyone has been in touch with the htmlparser open-source project. I have published two articles on htmlparser on IBM DW: extract the information you need from HTML and expand htmlparser's processing capabilities for custom tags. But now I no longer use htmlparser because htmlparser is rarely updated, but most importantly, jsoup is available.

Jsoup is a Java HTML Parser that can directly parse a URL address and HTML text content. It provides a set of very labor-saving APIs that can be used to retrieve and manipulate data through DOM, CSS, and operations similar to jquery.

The main functions of jsoup are as follows:

1. parse HTML from a URL, file, or string;

2. Use the Dom or CSS selector to find and retrieve data;

3. HTML elements, attributes, and text can be operated;

Jsoup is released based on the MIT protocol and can be safely used in commercial projects.

The main class hierarchy 1 of jsoup is shown below:

Figure 1. jsoup class hierarchy
 

Next we will illustrate how jsoup handles HTML documents elegantly in several common application scenarios.

Document input

Jsoup can load HTML documents from strings, URLs, and local files and generate document object instances.

The following code is used:

// Enter the HTML document string html = "<HTML> 

Please note that the third parameter of Parse in the last HTML document input method, why do you need to specify a URL here (although this parameter can be left unspecified, for example, the first method )? Because there are many HTML documents, such as links, images, referenced external scripts, and CSS files, the third parameter named baseurl means that when the HTML document uses relative paths to reference external files, jsoup automatically adds a prefix for these URLs, that is, this baseurl.

For example, <a href =/Project> Open Source Software </A> is converted to <a href = http://www.oschina.net/project> Open Source Software </a>.

Parse and extract HTML elements

This part involves the most basic functions of an HTML Parser, but jsoup uses a method different from other open-source projects-selector. We will detail the jsoup selector in the last part, in this section, you will see how jsoup is implemented with the simplest code.

However, jsoup also provides the traditional DOM method for element parsing. Let's look at the following code:

                  File input = new File("D:/test.html");  Document doc = Jsoup.parse(input, "UTF-8", "http://www.oschina.net/");  Element content = doc.getElementById("content");  Elements links = content.getElementsByTag("a");  for (Element link : links) {   String linkHref = link.attr("href");   String linkText = link.text();  } 

You may think that the jsoup method is familiar. That's right. The getelementbyid and getelementsbytag methods have the same name as the JavaScript method and have the same functions. You can obtain the corresponding element or element list based on the node name or HTML element ID.

Unlike the htmlparser project, jsoup does not define a corresponding class for HTML elements. Generally, an HTML element consists of node names, attributes, and text, jsoup provides a simple method for you to retrieve the data, which is also the reason why jsoup remains slim.

In terms of element retrieval, jsoup selectors are omnipotent,

1 file input = new file ("D: \ test.html"); 2 document DOC = jsoup. parse (input, "UTF-8", "http://www.oschina.net/"); 3 4 elements links = Doc. select ("A [href]"); // link with the href attribute 5 elements pngs = Doc. select ("imgw.src==.png]"); // all elements that reference the PNG Image 6 7 8 element masthead = Doc. select ("Div. masthead "). first (); 9 // find the element 10 11 elements resultlinks = Doc that defines class = masthead. select ("h3.r> A"); // direct a after h3

 

This is where jsoup really impressed me. jsoup uses the same selector as jquery to retrieve elements. If the above retrieval method is replaced with another HTML interpreter, at least a lot of lines of code are required, while jsoup only needs one line of code.

The jsoup selector also supports expression functions. We will introduce this super selector in the last section.

Modify data

When parsing a document, we may need to modify some elements in the document. For example, we can add clicklinks, modify link addresses, or modify texts for all images in the document.

Below are some simple examples:

1 Doc. select ("Div. comments "). ATTR ("rel", "nofollow"); 2 // Add the rel = nofollow attribute 3 Doc for all links. select ("Div. comments "). addclass ("mylinkclass"); 4 // Add class = mylinkclass attribute to all links 5 Doc. select ("IMG "). removeattr ("onclick"); // Delete the onclick attribute 6 Doc of all images. select ("input [type = text]"). val (""); // clear all text in the text input box

The principle is very simple. You only need to use the jsoup selector to find out the elements, and then you can use the above method to modify them, in addition to the tag name that cannot be modified (the new element can be deleted and inserted), the attributes and texts of the element can be modified.

After modification, you can directly call the HTML () method of element (s) to obtain the modified HTML document.

HTML document cleanup

While providing powerful APIs, jsoup is also very user-friendly. Users are often provided with comments when making websites. Some users are naughty and may make some scripts into the comments. These scripts may corrupt the behavior of the entire page. What's more serious is to get some confidential information, for example, XSS cross-site attacks.

Jsoup is very powerful and easy to use. Take a look at the following code:

1 2 string unsafe = "<p> <a href = 'HTTP: // www.oschina.net/'onclick = 'stealcookies () '> 3 open source Chinese community </a> </P> "; 4 string safe = jsoup. clean (unsafe, whitelist. basic (); 5 // output: 6 // <p> <a href = "http://www.oschina.net/" rel = "nofollow"> open source Chinese community </a> </P>

Jsoup uses a whitelist class to filter HTML documents. This class provides several common methods:

Table 1. Common Methods:

Method Name Introduction
None () Only text information is allowed
Basic () Allowed tags include: A, B, BLOCKQUOTE, BR, cite, code, DD, DL, DT, Em, I, Li, ol, P, pre, Q, small, strike, strong, sub, sup, U, UL, and appropriate attributes
Simpletext () Only labels B, Em, I, strong, and u are allowed.
Basicwithimages () Added Images Based on basic ().
Relaxed () This filter allows the most tags, including a, B, BLOCKQUOTE, BR, caption, cite, code, Col, colgroup, DD, DL, DT, em, H1, H2, h3, H4, H5, H6, I, IMG, Li, ol, P, pre, Q, small, strike, strong, sub, sup, table, tbody, TD, tfoot, th, thead, TR, U, UL

 

If none of the five filters meet your requirements, for example, you can allow users to insert Flash Animation. It doesn't matter. whitelist provides extended functions, such as whitelist. addtags ("embed", "object", "Param", "span", "Div"); you can also use addattributes to add attributes to some elements.

Jsoup's uniqueness-Selector

We have briefly introduced how jsoup uses selectors to retrieve elements. This section focuses on the powerful syntax of the selector. The following table lists all the syntax details of the jsoup selector.

 
Table 2. Basic usage:

Tagname Use the tag name to locate.
NS | tag Use the namespace tag to locate the <FB: Name> element, for example, FB: Name.
# ID Locate with element ID, for example, # logo
. Class Use the class attribute of the element to locate, such as. Head
[Attribute] Use the attributes of the element for positioning. For example, [href] indicates retrieving all elements with the href attribute.
[^ ATTR] Use the attribute name prefix of the element for locating. For example, [^ data-] is used to find the dataset attribute of HTML5.
[ATTR = value] Use attribute values for positioning, for example, [width = 500] to locate all elements whose width attribute value is 500
[ATTR ^ = value], [ATTR $ = value], [ATTR * = value] These three syntaxes indicate that attributes start with value, end with, and contain
[ATTR ~ = RegEx] Use regular expressions to filter attribute values, for example, IMG [SRC ~ = (? I) \. (PNG | jpe? G)]
* Locate all elements


 

 

 

 

 

 

 

 

These are the most basic selector syntaxes. These syntaxes can also be used in combination. The following is a combination of syntaxes supported by jsoup:

 

Table 3: combined usage:

 

El # ID Locate an element of the ID value, such as a # logo-> <a id = logo href =... >
El. Class Locate the element whose class is the specified value, such as Div. Head-> <Div class = "head"> XXXX </div>
El [ATTR] Locate all elements that define an attribute, such as a [href]
Any combination of the above three For example, a [href] # logo, a [name]. outerlink
Ancestor child These five are the selector syntax of the combination relationship between elements, including the parent-child relationship, the merging relationship, and the hierarchical relationship.
Parent> child
Siblinga + siblingb
Siblinga ~ Siblingx
El, El, El


 

 

 

 

 

 

 

In addition to some basic syntaxes and combinations of these syntaxes, jsoup also supports filtering and selecting elements using expressions. The following is a list of all expressions supported by jsoup:

Table 4. Expressions:

 

: Lt (N) For example, TD: Lt (3) indicates less than three columns.
: GT (N) Div P: GT (2) indicates that DIV contains more than 2 P
: Eq (N) Form input: eq (1) indicates that only one input form is contained.
: Has (seletor) Div: Has (p) indicates the DIV containing the P element
: Not (selector) Div: Not (. logo) indicates a list Of all divs that do not contain the class = "logo" element.
: Contains (text) Elements that contain a text, case-insensitive, such as P: Contains (oschina)
: Containsown (text) The text information is completely equal to the filtering of specified conditions.
: Matches (RegEx) Use regular expressions to filter text: div: matches ((? I) login)
: Matchesown (RegEx) Use regular expressions to find your own text


 

 

 

 

 

 

 

 

Summary

The basic functions of jsoup are described here. However, due to the excellent scalability API design of jsoup, you can use the definition of selector to develop very powerful HTML parsing functions. In addition, the development of the jsoup project itself is also very active. Therefore, if you are using Java and need to process HTML, try it.

 

 

 

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.