Jsoup Parsing HTML Information

Source: Internet
Author: User
Tags tagname

Jsoup Introduction

Jsoup is a Java HTML parser that can parse a URL address and HTML text content directly. It provides a very labor-saving API,

Data can be fetched and manipulated via Dom,css and jquery-like operations.

The main functions of Jsoup are as follows
    • 1. Parsing html from a URL, file, or string

    • 2. Use the DOM or CSS selector to find and extract data

    • 3. Can manipulate HTML elements, attributes, text

Jsoup's main class hierarchy:

Document input

Jsoup can load HTML documents from including strings, URL addresses, and local files, and generate a Document object instance.

// 直接从字符串中输入 HTML 文档"    "<body id=‘body‘><p>Parse and traverse an HTML document.</p></body>;Document doc = Jsoup.parse(html);// 从URL直接加载 HTML 文档Document doc = Jsoup.connect("http://itmyhome.com/").get();String title = doc.title();// 从文件中加载HTML文档new File("D:/index.html""UTF-8","http://itmyhome.com");

The third way, the parse method can also not specify the third parameter, because there will be a lot of HTML documents such as links, pictures and referenced external scripts, CSS files, etc.

The third parameter named BaseURL means that when an HTML document references an external file using a relative path,

Jsoup automatically adds a prefix to these URLs, which is the BaseURL.

For example, <a href=/project>itmyhome</a> it will be converted into a <a href=http://itmyhome.com/project>itmyhome</a> .

Data extraction uses DOM methods to traverse a document
"        "<body id=‘content‘><a href=‘itmyhome.com‘>hello</a>"        "<a href=‘blog.itmyhome.com‘>jsoup</a></body>;Document doc = Jsoup.parse(html);Element content = doc.getElementById("content");Elements links = content.getElementsByTag("a");for (Element link : links) {    String linkHref = link.attr("href");    String linkText = link.text();    ", " + linkText);}

Print

itmyhome.com, helloblog.itmyhome.com, jsoup

Description

Elements This object provides a series of Dom-like methods to find elements, extract and manipulate the data in them. Specific as follows:

Find element

    • getElementById (String ID)

    • Getelementsbytag (String tag)

    • Getelementsbyclass (String className)

    • Getelementsbyattribute (String key) (and related methods)

    • Element siblings:siblingelements (), firstelementsibling (), lastelementsibling (); Nextelementsibling (), previouselementsibling ()

    • Graph:parent (), children (), child (int index)

Element data

    • attr (string key) Get property attr (String key, String value) Set property

    • Attributes () Get all properties

    • ID (), className () and Classnames ()

    • Text () Gets the textual content text (String value) sets the textual content

    • HTML () Gets the HTML content within the element of the htmlhtml (String value) setting element

    • outerHTML () Get out-of-element HTML content

    • Data (for example: script and style tags)

    • Tag () and TagName ()

Manipulating HTML and text

    • Append (string html), prepend (string html)

    • AppendText (string text), Prependtext (string text)

    • Appendelement (String tagName), Prependelement (string tagName)

    • HTML (String value)

Use selector syntax to find elements
Document doc = Jsoup.connect("http://itmyhome.com/").get();Elements links = doc.select("a[href]"// 带有href属性的a元素Elements pngs = doc.select("img[src$=.png]");// 扩展名为.png的图片Element icons = doc.select("span.icon").first();// class等于icon的span标签Elements resultLinks = doc.select("#header p"// id为header元素之后的p元素

From the above you can see that Jsoup uses the same selector as jquery to retrieve elements, and the Jsoup selector also supports expression functionality

The following table is a detailed list of all the syntax for the Jsoup selector.

Table 1. Basic usage:

TagName Use tag names to locate, for example a
Ns|tag Use namespace label positioning, such as fb:name to find <fb:name> elements
#id Use element ID to locate, for example #logo
. class Use the class property of the element to locate, for example,. Head
[Attribute] Use the attributes of an element for positioning, such as [href] to retrieve all elements that have an HREF attribute
[^attr] Use the element's property name prefix for positioning, such as [^data-] to find the DataSet property of HTML5
[Attr=value] Use property values for positioning, such as [width=500] to locate all elements with a width property value of 500
[Attr^=value], [Attr$=value], [Attr*=value] These three grammars represent, respectively, the attributes begin with value, end with a and contain
[Attr~=regex] Use regular expressions to filter property values, such as img[src~= (? i) \. ( PNG|JPE?G)]
* Locate all elements

The above is the most basic selector syntax, which can also be combined to use, the following is a combination of jsoup support usage:

Table 2: Combination usage:

El#id Locate an element of ID value, such as A#logo-<a Id=logo href= ... >
El.class Locate the element with the specified value, such as Div.head-<div class=head>xxxx</div>
EL[ATTR] Locates all elements that define a property, such as A[href]
Any combination of the above three For example A[href] #logo, A[name].outerlink
Ancestor Child These five are the selector syntax for combining relationships between elements, including parent-child relationships, merge relationships, and hierarchical relationships.
Parent > Child
Siblinga + SIBLINGB
Siblinga ~ SIBLINGX
El, El, El

In addition to some basic syntax and combinations, Jsoup also supports element filtering using expressions. The following is a list of all the expressions supported by Jsoup:

Table 3: Expressions:

: LT (N) For example, TD:LT (3) indicates less than three columns
: GT (N) Div P:GT (2) indicates that a div contains more than 2 p
: EQ (N) Form Input:eq (1) indicates that only one input is included
: Has (Seletor) Div:has (p) represents the div containing the P element
: Not (selector) Div:not (. logo) represents all Div lists that do not contain class=logo elements
: Contains (text) An element that contains text that is not case-sensitive, such as P:contains (Oschina)
: Containsown (text) The text information is exactly equal to the filter of the specified condition
: Matches (regex) Using regular expressions for text filtering: Div:matches ((? i) login)
: Matchesown (Regex) Find your own text using regular expressions
Extract attributes from elements, text and HTML
    • To get the value of a property, you can use the Node.attr (String key) method

    • For text in an element, you can use the Element.text () method

    • For HTML content in an element or attribute, you can use the element.html (), or the node.outerhtml () method

Example:

String HTML ="<p>my <a href= ' http://itmyhome.com/' ><b>blog</b></a> link.</p>";D ocument doc = jsoup.parse (HTML);//Parse HTML string returns a document implementationElement link = doc.select ("a"). First ();//Find First A elementString text = Doc.body (). text ();//"My Blog link" gets the text in the stringString linkhref = link.attr ("href");//"http://itmyhome.com/" Get link addressString LinkText = Link.text ();//"blog" gets the text in the link addressString Linkouterh = link.outerhtml ();//"<a href=" http://itmyhome.com/"><b>blog</b></a> "String Linkinnerh = link.html ();//"<b>blog</b>" to get HTML content within the linkSystem.out.println (text); System.out.println (LINKHREF); System.out.println (LinkText); System.out.println (LINKOUTERH); System.out.println (LINKINNERH);

Print:

my blog link.http://itmyhome.com/blog<a href="http://itmyhome.com/"><b>blog</b></a><b>blog</b>

Description

The above approach is the core approach to element data access. In addition, there are other ways to use:

    • Element.id ()

    • Element.tagname ()

    • Element.classname () and Element.hasclass (String className)

modifying data

While parsing a document, we may need to modify some elements of the document, such as we can add clickable links to all the images in the document, modify the link address, or modify the text.

Here are some simple examples:

doc.select("div.comments a").attr("rel""nofollow"// 为所有链接增加 rel=nofollow 属性doc.select("div.comments a").addClass("mylinkclass"// 为所有链接增加 class=mylinkclass 属性doc.select("img").removeAttr("onclick"// 删除所有图片的 onclick 属性doc.select("input[type=text]").val(""// 清空所有文本输入框中的文本

The truth is simple, you just need to use the Jsoup selector to find the element, and then you can change the above method,

After modifying the HTML () method of calling Element (s) directly, you can get the modified HTML document.

HTML Document Cleanup

When doing the website, often will provide the user comment function. Some users who are not bad intentions, will make some script into the comment content,

These scripts can disrupt the behavior of the entire page and, more seriously, get some confidential information, such as XSS cross-site attacks.

Use the Jsoup HTML Cleaner method to clear it and see the following code:

"<p><a href=‘http://itmyhome.com/‘ onclick=‘stealCookies()‘>itmyhome</a></p>"//输出 : <p><a href="http://itmyhome.com/" rel="nofollow">itmyhome</a></p>

Jsoup uses a whitelist class to filter the HTML document, which provides several common methods:

None () Allow only text information to be included
Basic () Allowable tags include: A, B, blockquote, BR, cite, code, DD, DL, DT, EM, I, Li, Ol, p, pre, q, small, strike, strong, sub, SUP, U, UL, and The right attributes
SimpleText () Allow only B, EM, I, strong, u these labels
Basicwithimages () Added a picture on basic ()
Relaxed () This filter allows for the most labels, including: A, B, blockquote, BR, caption, cite, code, col, Colgroup, DD, DL, DT, EM, H1, H2, H3, H4, H5, H6, I, IMG, Li, OL, p, pre, q, small, strike, strong, sub, sup, table, TBODY, TD, TFOOT, TH, THEAD, tr, u, ul



Itmyhome

Jsoup Parsing HTML Information

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.