Jsoup Parsing HTML Information

Last Update:2016-02-17 Source: Internet

Author: User

Tags tagname

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Jsoup Introduction

Jsoup is a Java HTML parser that can parse a URL address and HTML text content directly. It provides a very labor-saving API,

Data can be fetched and manipulated via Dom,css and jquery-like operations.

The main functions of Jsoup are as follows

1. Parsing html from a URL, file, or string
2. Use the DOM or CSS selector to find and extract data
3. Can manipulate HTML elements, attributes, text

Jsoup's main class hierarchy:

Document input

Jsoup can load HTML documents from including strings, URL addresses, and local files, and generate a Document object instance.

// 直接从字符串中输入 HTML 文档"    "<body id=‘body‘><p>Parse and traverse an HTML document.</p></body>;Document doc = Jsoup.parse(html);// 从URL直接加载 HTML 文档Document doc = Jsoup.connect("http://itmyhome.com/").get();String title = doc.title();// 从文件中加载HTML文档new File("D:/index.html""UTF-8","http://itmyhome.com");

The third way, the parse method can also not specify the third parameter, because there will be a lot of HTML documents such as links, pictures and referenced external scripts, CSS files, etc.

The third parameter named BaseURL means that when an HTML document references an external file using a relative path,

Jsoup automatically adds a prefix to these URLs, which is the BaseURL.

For example, <a href=/project>itmyhome</a> it will be converted into a <a href=http://itmyhome.com/project>itmyhome</a> .

Data extraction uses DOM methods to traverse a document

"        "<body id=‘content‘><a href=‘itmyhome.com‘>hello</a>"        "<a href=‘blog.itmyhome.com‘>jsoup</a></body>;Document doc = Jsoup.parse(html);Element content = doc.getElementById("content");Elements links = content.getElementsByTag("a");for (Element link : links) {    String linkHref = link.attr("href");    String linkText = link.text();    ", " + linkText);}

itmyhome.com, helloblog.itmyhome.com, jsoup

Description

Elements This object provides a series of Dom-like methods to find elements, extract and manipulate the data in them. Specific as follows:

Find element

getElementById (String ID)
Getelementsbytag (String tag)
Getelementsbyclass (String className)
Getelementsbyattribute (String key) (and related methods)
Element siblings:siblingelements (), firstelementsibling (), lastelementsibling (); Nextelementsibling (), previouselementsibling ()
Graph:parent (), children (), child (int index)

Element data

attr (string key) Get property attr (String key, String value) Set property
Attributes () Get all properties
ID (), className () and Classnames ()
Text () Gets the textual content text (String value) sets the textual content
HTML () Gets the HTML content within the element of the htmlhtml (String value) setting element
outerHTML () Get out-of-element HTML content
Data (for example: script and style tags)
Tag () and TagName ()

Manipulating HTML and text

Append (string html), prepend (string html)
AppendText (string text), Prependtext (string text)
Appendelement (String tagName), Prependelement (string tagName)
HTML (String value)

Use selector syntax to find elements

Document doc = Jsoup.connect("http://itmyhome.com/").get();Elements links = doc.select("a[href]"// 带有href属性的a元素Elements pngs = doc.select("img[src$=.png]");// 扩展名为.png的图片Element icons = doc.select("span.icon").first();// class等于icon的span标签Elements resultLinks = doc.select("#header p"// id为header元素之后的p元素

From the above you can see that Jsoup uses the same selector as jquery to retrieve elements, and the Jsoup selector also supports expression functionality

The following table is a detailed list of all the syntax for the Jsoup selector.

Table 1. Basic usage:

TagName	Use tag names to locate, for example a
Ns\|tag	Use namespace label positioning, such as fb:name to find <fb:name> elements
#id	Use element ID to locate, for example #logo
. class	Use the class property of the element to locate, for example,. Head
[Attribute]	Use the attributes of an element for positioning, such as [href] to retrieve all elements that have an HREF attribute
[^attr]	Use the element's property name prefix for positioning, such as [^data-] to find the DataSet property of HTML5
[Attr=value]	Use property values for positioning, such as [width=500] to locate all elements with a width property value of 500
[Attr^=value], [Attr$=value], [Attr*=value]	These three grammars represent, respectively, the attributes begin with value, end with a and contain
[Attr~=regex]	Use regular expressions to filter property values, such as img[src~= (? i) \. ( PNG\|JPE?G)]
*	Locate all elements

The above is the most basic selector syntax, which can also be combined to use, the following is a combination of jsoup support usage:

Table 2: Combination usage:

El#id	Locate an element of ID value, such as A#logo-<a Id=logo href= ... >
El.class	Locate the element with the specified value, such as Div.head-<div class=head>xxxx</div>
EL[ATTR]	Locates all elements that define a property, such as A[href]
Any combination of the above three	For example A[href] #logo, A[name].outerlink
Ancestor Child	These five are the selector syntax for combining relationships between elements, including parent-child relationships, merge relationships, and hierarchical relationships.
Parent > Child
Siblinga + SIBLINGB
Siblinga ~ SIBLINGX
El, El, El

In addition to some basic syntax and combinations, Jsoup also supports element filtering using expressions. The following is a list of all the expressions supported by Jsoup:

Table 3: Expressions:

: LT (N)	For example, TD:LT (3) indicates less than three columns
: GT (N)	Div P:GT (2) indicates that a div contains more than 2 p
: EQ (N)	Form Input:eq (1) indicates that only one input is included
: Has (Seletor)	Div:has (p) represents the div containing the P element
: Not (selector)	Div:not (. logo) represents all Div lists that do not contain class=logo elements
: Contains (text)	An element that contains text that is not case-sensitive, such as P:contains (Oschina)
: Containsown (text)	The text information is exactly equal to the filter of the specified condition
: Matches (regex)	Using regular expressions for text filtering: Div:matches ((? i) login)
: Matchesown (Regex)	Find your own text using regular expressions

Extract attributes from elements, text and HTML

To get the value of a property, you can use the Node.attr (String key) method
For text in an element, you can use the Element.text () method
For HTML content in an element or attribute, you can use the element.html (), or the node.outerhtml () method

Example:

String HTML ="<p>my <a href= ' http://itmyhome.com/' ><b>blog</b></a> link.</p>";D ocument doc = jsoup.parse (HTML);//Parse HTML string returns a document implementationElement link = doc.select ("a"). First ();//Find First A elementString text = Doc.body (). text ();//"My Blog link" gets the text in the stringString linkhref = link.attr ("href");//"http://itmyhome.com/" Get link addressString LinkText = Link.text ();//"blog" gets the text in the link addressString Linkouterh = link.outerhtml ();//"<a href=" http://itmyhome.com/"><b>blog</b></a> "String Linkinnerh = link.html ();//"<b>blog</b>" to get HTML content within the linkSystem.out.println (text); System.out.println (LINKHREF); System.out.println (LinkText); System.out.println (LINKOUTERH); System.out.println (LINKINNERH);

Print:

my blog link.http://itmyhome.com/blog<a href="http://itmyhome.com/"><b>blog</b></a><b>blog</b>

Description

The above approach is the core approach to element data access. In addition, there are other ways to use:

Element.id ()
Element.tagname ()
Element.classname () and Element.hasclass (String className)

modifying data

While parsing a document, we may need to modify some elements of the document, such as we can add clickable links to all the images in the document, modify the link address, or modify the text.

Here are some simple examples:

doc.select("div.comments a").attr("rel""nofollow"// 为所有链接增加 rel=nofollow 属性doc.select("div.comments a").addClass("mylinkclass"// 为所有链接增加 class=mylinkclass 属性doc.select("img").removeAttr("onclick"// 删除所有图片的 onclick 属性doc.select("input[type=text]").val(""// 清空所有文本输入框中的文本

The truth is simple, you just need to use the Jsoup selector to find the element, and then you can change the above method,

After modifying the HTML () method of calling Element (s) directly, you can get the modified HTML document.

HTML Document Cleanup

When doing the website, often will provide the user comment function. Some users who are not bad intentions, will make some script into the comment content,

These scripts can disrupt the behavior of the entire page and, more seriously, get some confidential information, such as XSS cross-site attacks.

Use the Jsoup HTML Cleaner method to clear it and see the following code:

"<p><a href=‘http://itmyhome.com/‘ onclick=‘stealCookies()‘>itmyhome</a></p>"//输出 : <p><a href="http://itmyhome.com/" rel="nofollow">itmyhome</a></p>

Jsoup uses a whitelist class to filter the HTML document, which provides several common methods:

None ()	Allow only text information to be included
Basic ()	Allowable tags include: A, B, blockquote, BR, cite, code, DD, DL, DT, EM, I, Li, Ol, p, pre, q, small, strike, strong, sub, SUP, U, UL, and The right attributes
SimpleText ()	Allow only B, EM, I, strong, u these labels
Basicwithimages ()	Added a picture on basic ()
Relaxed ()	This filter allows for the most labels, including: A, B, blockquote, BR, caption, cite, code, col, Colgroup, DD, DL, DT, EM, H1, H2, H3, H4, H5, H6, I, IMG, Li, OL, p, pre, q, small, strike, strong, sub, sup, table, TBODY, TD, TFOOT, TH, THEAD, tr, u, ul

Itmyhome

Jsoup Parsing HTML Information

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More