Jsoup Introduction
Jsoup is a Java HTML parser that can parse a URL address and HTML text content directly. It provides a very labor-saving API,
Data can be fetched and manipulated via Dom,css and jquery-like operations.
The main functions of Jsoup are as follows
1. Parsing html from a URL, file, or string
2. Use the DOM or CSS selector to find and extract data
3. Can manipulate HTML elements, attributes, text
Jsoup's main class hierarchy:
Document input
Jsoup can load HTML documents from including strings, URL addresses, and local files, and generate a Document object instance.
// 直接从字符串中输入 HTML 文档" "<body id=‘body‘><p>Parse and traverse an HTML document.</p></body>;Document doc = Jsoup.parse(html);// 从URL直接加载 HTML 文档Document doc = Jsoup.connect("http://itmyhome.com/").get();String title = doc.title();// 从文件中加载HTML文档new File("D:/index.html""UTF-8","http://itmyhome.com");
The third way, the parse method can also not specify the third parameter, because there will be a lot of HTML documents such as links, pictures and referenced external scripts, CSS files, etc.
The third parameter named BaseURL means that when an HTML document references an external file using a relative path,
Jsoup automatically adds a prefix to these URLs, which is the BaseURL.
For example, <a href=/project>itmyhome</a>
it will be converted into a <a href=http://itmyhome.com/project>itmyhome</a>
.
Data extraction uses DOM methods to traverse a document
" "<body id=‘content‘><a href=‘itmyhome.com‘>hello</a>" "<a href=‘blog.itmyhome.com‘>jsoup</a></body>;Document doc = Jsoup.parse(html);Element content = doc.getElementById("content");Elements links = content.getElementsByTag("a");for (Element link : links) { String linkHref = link.attr("href"); String linkText = link.text(); ", " + linkText);}
Print
itmyhome.com, helloblog.itmyhome.com, jsoup
Description
Elements This object provides a series of Dom-like methods to find elements, extract and manipulate the data in them. Specific as follows:
Find element
getElementById (String ID)
Getelementsbytag (String tag)
Getelementsbyclass (String className)
Getelementsbyattribute (String key) (and related methods)
Element siblings:siblingelements (), firstelementsibling (), lastelementsibling (); Nextelementsibling (), previouselementsibling ()
Graph:parent (), children (), child (int index)
Element data
attr (string key) Get property attr (String key, String value) Set property
Attributes () Get all properties
ID (), className () and Classnames ()
Text () Gets the textual content text (String value) sets the textual content
HTML () Gets the HTML content within the element of the htmlhtml (String value) setting element
outerHTML () Get out-of-element HTML content
Data (for example: script and style tags)
Tag () and TagName ()
Manipulating HTML and text
Append (string html), prepend (string html)
AppendText (string text), Prependtext (string text)
Appendelement (String tagName), Prependelement (string tagName)
HTML (String value)
Use selector syntax to find elements
Document doc = Jsoup.connect("http://itmyhome.com/").get();Elements links = doc.select("a[href]"// 带有href属性的a元素Elements pngs = doc.select("img[src$=.png]");// 扩展名为.png的图片Element icons = doc.select("span.icon").first();// class等于icon的span标签Elements resultLinks = doc.select("#header p"// id为header元素之后的p元素
From the above you can see that Jsoup uses the same selector as jquery to retrieve elements, and the Jsoup selector also supports expression functionality
The following table is a detailed list of all the syntax for the Jsoup selector.
Table 1. Basic usage:
TagName |
Use tag names to locate, for example a |
Ns|tag |
Use namespace label positioning, such as fb:name to find <fb:name> elements |
#id |
Use element ID to locate, for example #logo |
. class |
Use the class property of the element to locate, for example,. Head |
[Attribute] |
Use the attributes of an element for positioning, such as [href] to retrieve all elements that have an HREF attribute |
[^attr] |
Use the element's property name prefix for positioning, such as [^data-] to find the DataSet property of HTML5 |
[Attr=value] |
Use property values for positioning, such as [width=500] to locate all elements with a width property value of 500 |
[Attr^=value], [Attr$=value], [Attr*=value] |
These three grammars represent, respectively, the attributes begin with value, end with a and contain |
[Attr~=regex] |
Use regular expressions to filter property values, such as img[src~= (? i) \. ( PNG|JPE?G)] |
* |
Locate all elements |
The above is the most basic selector syntax, which can also be combined to use, the following is a combination of jsoup support usage:
Table 2: Combination usage:
El#id |
Locate an element of ID value, such as A#logo-<a Id=logo href= ... > |
El.class |
Locate the element with the specified value, such as Div.head-<div class=head>xxxx</div> |
EL[ATTR] |
Locates all elements that define a property, such as A[href] |
Any combination of the above three |
For example A[href] #logo, A[name].outerlink |
Ancestor Child |
These five are the selector syntax for combining relationships between elements, including parent-child relationships, merge relationships, and hierarchical relationships. |
Parent > Child |
Siblinga + SIBLINGB |
Siblinga ~ SIBLINGX |
El, El, El |
In addition to some basic syntax and combinations, Jsoup also supports element filtering using expressions. The following is a list of all the expressions supported by Jsoup:
Table 3: Expressions:
: LT (N) |
For example, TD:LT (3) indicates less than three columns |
: GT (N) |
Div P:GT (2) indicates that a div contains more than 2 p |
: EQ (N) |
Form Input:eq (1) indicates that only one input is included |
: Has (Seletor) |
Div:has (p) represents the div containing the P element |
: Not (selector) |
Div:not (. logo) represents all Div lists that do not contain class=logo elements |
: Contains (text) |
An element that contains text that is not case-sensitive, such as P:contains (Oschina) |
: Containsown (text) |
The text information is exactly equal to the filter of the specified condition |
: Matches (regex) |
Using regular expressions for text filtering: Div:matches ((? i) login) |
: Matchesown (Regex) |
Find your own text using regular expressions |
Extract attributes from elements, text and HTML
To get the value of a property, you can use the Node.attr (String key) method
For text in an element, you can use the Element.text () method
For HTML content in an element or attribute, you can use the element.html (), or the node.outerhtml () method
Example:
String HTML ="<p>my <a href= ' http://itmyhome.com/' ><b>blog</b></a> link.</p>";D ocument doc = jsoup.parse (HTML);//Parse HTML string returns a document implementationElement link = doc.select ("a"). First ();//Find First A elementString text = Doc.body (). text ();//"My Blog link" gets the text in the stringString linkhref = link.attr ("href");//"http://itmyhome.com/" Get link addressString LinkText = Link.text ();//"blog" gets the text in the link addressString Linkouterh = link.outerhtml ();//"<a href=" http://itmyhome.com/"><b>blog</b></a> "String Linkinnerh = link.html ();//"<b>blog</b>" to get HTML content within the linkSystem.out.println (text); System.out.println (LINKHREF); System.out.println (LinkText); System.out.println (LINKOUTERH); System.out.println (LINKINNERH);
Print:
my blog link.http://itmyhome.com/blog<a href="http://itmyhome.com/"><b>blog</b></a><b>blog</b>
Description
The above approach is the core approach to element data access. In addition, there are other ways to use:
modifying data
While parsing a document, we may need to modify some elements of the document, such as we can add clickable links to all the images in the document, modify the link address, or modify the text.
Here are some simple examples:
doc.select("div.comments a").attr("rel""nofollow"// 为所有链接增加 rel=nofollow 属性doc.select("div.comments a").addClass("mylinkclass"// 为所有链接增加 class=mylinkclass 属性doc.select("img").removeAttr("onclick"// 删除所有图片的 onclick 属性doc.select("input[type=text]").val(""// 清空所有文本输入框中的文本
The truth is simple, you just need to use the Jsoup selector to find the element, and then you can change the above method,
After modifying the HTML () method of calling Element (s) directly, you can get the modified HTML document.
HTML Document Cleanup
When doing the website, often will provide the user comment function. Some users who are not bad intentions, will make some script into the comment content,
These scripts can disrupt the behavior of the entire page and, more seriously, get some confidential information, such as XSS cross-site attacks.
Use the Jsoup HTML Cleaner method to clear it and see the following code:
"<p><a href=‘http://itmyhome.com/‘ onclick=‘stealCookies()‘>itmyhome</a></p>"//输出 : <p><a href="http://itmyhome.com/" rel="nofollow">itmyhome</a></p>
Jsoup uses a whitelist class to filter the HTML document, which provides several common methods:
None () |
Allow only text information to be included |
Basic () |
Allowable tags include: A, B, blockquote, BR, cite, code, DD, DL, DT, EM, I, Li, Ol, p, pre, q, small, strike, strong, sub, SUP, U, UL, and The right attributes |
SimpleText () |
Allow only B, EM, I, strong, u these labels |
Basicwithimages () |
Added a picture on basic () |
Relaxed () |
This filter allows for the most labels, including: A, B, blockquote, BR, caption, cite, code, col, Colgroup, DD, DL, DT, EM, H1, H2, H3, H4, H5, H6, I, IMG, Li, OL, p, pre, q, small, strike, strong, sub, sup, table, TBODY, TD, TFOOT, TH, THEAD, tr, u, ul |
Itmyhome
Jsoup Parsing HTML Information