Jsoup Usage Explanation

Last Update:2014-07-31 Source: Internet

Author: User

Tags html interpreter

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Listing 1

      Enter HTML document directly from string html = "

Please note that the third parameter of parse in the last HTML document input method, why do you need to specify a URL here (although can not be specified, such as the first method)? Because there will be a lot of HTML documents such as links, pictures, and referenced external scripts, CSS files, and so on, and the third parameter named BaseURL means that when an HTML document refers to an external file using a relative path, Jsoup automatically adds a prefix to the URL, which is the BA Seurl.

For example <a href=/project> open source software </a> will be converted into <a href=http://www.oschina.net/project> open source software </a>.

Parsing and extracting HTML elements

This section covers the most basic functionality of an HTML parser, but Jsoup uses a different way from other open source projects-the selector, which we'll cover in detail in the last section, which you'll see how Jsoup is implemented with the simplest code.

But Jsoup also provides a traditional DOM approach to elemental parsing, looking at the following code:

Listing 2.

      New File ("d:/test.html");  Document doc = jsoup.parse (input, "UTF-8", "http://www.oschina.net/");  Element content = Doc.getelementbyid ("content");  Elements links = Content.getelementsbytag ("a");   for (Element link:links) {   String linkhref = link.attr ("href");   String LinkText = Link.text ();

You may find the Jsoup method familiar, yes, like the getElementById and Getelementsbytag methods are the same as the JavaScript method names, and the functions are exactly the same. You can get the corresponding element or list of elements based on the node name or the ID of the HTML element.

Unlike the Htmlparser project, Jsoup does not define a corresponding class for the HTML element, and the general component of an HTML element includes: node name, attributes, and text, Jsoup provides a simple way for you to retrieve the data yourself, which is why jsoup remains thin.

In terms of element retrieval, the Jsoup selector is simply omnipotent,

Listing 3.

      New File ("D:\test.html");  Document doc = jsoup.parse (input, "UTF-8", "http://www.oschina.net/");  Elements links = doc.select ("a[href]"); Links with href attributes Elements PNGs = Doc.select ("img[src$=.png]");//All elements referencing png images element masthead = Doc.select ("Div.masthea D "). First ();

This is jsoup really let me down, jsoup use the same selector as jQuery to retrieve the elements, the above method if the other HTML interpreter, at least requires a lot of lines of code, and Jsoup only need a line of code to complete.

The Jsoup selector also supports expression functionality, and we'll introduce this super selector in the last section.

modifying data

While parsing a document, we may need to modify some elements of the document, such as we can add clickable links to all the images in the document, modify the link address, or modify the text.

Here are some simple examples:

Listing 4.

      Doc.select ("Div.comments a"). attr ("rel", "nofollow");  Add the Rel=nofollow attribute Doc.select ("div.comments a") for all links. AddClass ("Mylinkclass");  Add class= "Mylinkclass" Property Doc.select ("img") for all links. Removeattr ("//delete all pictures of the OnClick property Doc.select (" Input[type=text] "). Val (""); Clears all text in the text input box

The truth is simple, you only need to use the Jsoup selector to find the element, and then you can modify it through the above method, in addition to the name of the tag can not be modified (delete and then insert a new element), including the element's attributes and text can be modified.

After modifying the HTML () method of calling Element (s) directly, you can get the modified HTML document.

HTML Document Cleanup

Jsoup in providing a powerful API at the same time, the human side is also doing very well. When doing the website, often will provide the user comment function. Some users are naughty, will engage in some script to comment content, and these scripts may break the entire page behavior, more serious is to obtain some confidential information, such as XSS cross-site attacks and so on.

Jsoup support in this area is very powerful and very simple to use. Take a look at the following code:

Listing 5.

      String unsafe = "<p><a href= ' http://www.oschina.net/'   open source China community </a></p>";  String safe = Jsoup.clean (unsafe, whitelist.basic ());  Output:

Jsoup uses a Whitelist class to filter the HTML document, which provides several common methods:

Table 1. Common methods:

Method Name Introduction

None ()	Allow only text information to be included
Basic ()	Allowable tags include: A, B, blockquote, BR, cite, code, DD, DL, DT, EM, I, Li, Ol, p, pre, q, small, strike, strong, sub, SUP, U, UL, and The right attributes
SimpleText ()	Allow only B, EM, I, strong, u these labels
Basicwithimages ()	Added a picture on basic ()
Relaxed ()	This filter allows for the most labels, including: A, B, blockquote, BR, caption, cite, code, col, Colgroup, DD, DL, DT, EM, H1, H2, H3, H4, H5, H6, I, IMG, Li, OL, p, pre, q, small, strike, strong, sub, sup, table, TBODY, TD, TFOOT, TH, THEAD, tr, u, ul

If none of the five filters can meet your requirements, such as allowing users to insert flash animations, it's okay, Whitelist provides extended functionality such as Whitelist.addtags ("embed", "Object", "param", "span", " Div "); You can also call AddAttributes to add attributes to some elements.

The jsoup of the most extraordinary--selectors

Earlier, we have briefly described how Jsoup uses selectors to retrieve elements. In this section we focus on the powerful syntax of the selector itself. The following table is a detailed list of all the syntax for the Jsoup selector.

Table 2. Basic usage:

tagname	use tag names to locate, for example a
ns\|tag	use namespace-label positioning, such as fb:name to find <fb:name> elements
#id	using element ID positioning, for example #logo
. Class	is positioned using the element's Class property, for example. Head
[attribute]	is positioned using the attributes of the element, such as [href] to retrieve all elements that have an href attribute
[^attr]	is positioned using the element's property name prefix, such as [^data-] to find the DataSet property of HTML5
[attr=value]	Use property values for positioning, such as [width=500] to locate all elements with a width property value of 500
*[Attr^=value], [Attr$=value], [Attr=value]**	These three syntaxes represent each, the property begins with value, ends with a
[Attr~=regex]	Use regular expressions to filter property values, such as img[src~= (? i) \. PNG\|JPE?G)]
*	position all elements

The above is the most basic selector syntax, which can also be combined to use, the following is a combination of jsoup support usage:

Table 3: Combination usage:

El#id	Locate an element of ID value, such as A#logo-<a Id=logo href= ... >
El.class	Locate the element with the specified value, such as Div.head, <div class= "Head" >xxxx</div>
EL[ATTR]	Locates all elements that define a property, such as A[href]
Any combination of the above three	For example A[href] #logo, A[name].outerlink
Ancestor Child	These five are the selector syntax for combining relationships between elements, including parent-child relationships, merge relationships, and hierarchical relationships.
Parent > Child
Siblinga + SIBLINGB
Siblinga ~ SIBLINGX
El, El, El

In addition to some basic syntax and the combination of these syntaxes, Jsoup also supports element filtering using expressions. The following is a list of all the expressions supported by Jsoup:

Table 4. An expression:

The

: LT (n)	For example TD:LT (3) Represents less than three columns
: GT (n)	div p:gt (2) indicates that the div contains more than 2 p
: eq (n)	form Input:eq (1) Represents a form that contains only one input
: Has (seletor)	Div:has (p) represents Div with P element
: Not (selector)	div:not (. logo) represents all Div lists that do not contain class= "logo"
: Contains (text)	contains elements of a text that are not case-sensitive, such as P:contains (Oschina)
: Containsown (text)	text information is exactly equal to the filter for the specified condition
: Matches (regex)	using regular expressions for text filtering: Div:matches ((? i) login)
: Matchesown (regex)	Use regular expressions to find your own text

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More