Detailed Jsoup Select selector syntax
This article references: Jsoup Chinese documents
Problem
You want to use CSS or jquery-like syntax to find and manipulate elements.
Method
Can be Element.select(String selector)
implemented using and Elements.select(String selector)
methods:
//从本地加载html文件File input = new File("/tmp/input.html");Document doc = Jsoup.parse(input, "UTF-8", "http://example.com/");//编码以及HTML页面URL前戳Elements links = doc.select("a[href]"); //带有href属性的a元素Elements pngs = doc.select("img[src$=.png]"); //扩展名为.png的图片Element masthead = doc.select("div.masthead").first(); //class等于masthead的div标签Elements resultLinks = doc.select("h3.r > a"); //在h3元素之后的a元素
Description
The Jsoup elements object supports selector syntax similar to CSS (or jquery) to achieve very powerful and flexible search functionality.
This select
method Document
Element
can be used in,, or Elements
in an object. and is context-sensitive, so filtering of the specified element can be implemented, or a chain-selectable access.
The Select method returns a Elements
collection and provides a set of methods to extract and manipulate the results.
Selector Selector overview
tagname
: Find elements through tags, such as:a
ns|tag
: Find elements in namespaces through tags, such as: You can use fb|name
syntax to find <fb:name>
elements
#id
: Finds elements by ID, such as:#logo
.class
: Finds elements by class name, such as:.masthead
[attribute]
: Use attributes to find elements such as:[href]
[^attr]
: Use attribute name prefixes to find elements, such as: can be used [^data-]
to find elements with the HTML5 DataSet attribute
[attr=value]
: Use attribute values to find elements, such as:[width=500]
[attr^=value]
, [attr$=value]
, [attr*=value]
: Finds an element using the Match property value beginning, ending, or containing property values, such as:[href*=/path/]
[attr~=regex]
: use attribute values to match regular expressions to find elements, such as:img[src~=(?i)\.(png|jpe?g)]
*
: This symbol will match all elements
Selector selector combination use
el#id
: element +id, e.g.:div#logo
el.class
: element +class, e.g.:div.masthead
el[attr]
: element +class, e.g.:a[href]
- Any combination, such as:
a[href].highlight
ancestor child
: Finds elements of an element, such as: can be used to .body p
find all elements under the "body" element p
parent > child
: Finds immediate child elements under a parent element, such as: You can div.content > p
find an p
element, or you can body > *
find all the immediate child elements under the body tag
siblingA + siblingB
: Find the first sibling element B before the A element, for example:div.head + div
siblingA ~ siblingX
: Finds the sibling x element before the a element, such as:h1 ~ p
el, el, el
: Multiple selector combinations to find unique elements that match either selector, for example:div.masthead, div.logo
Pseudo Selector Selectors
:lt(n)
: Finds which element's sibling index value (its position is relative to its parent node in the DOM tree) is less than n, for example: td:lt(3)
an element that represents less than three columns
:gt(n)
: Finds which elements have a sibling index value greater than n``,比如
: div p:gt(2)
indicates which Div contains more than 2 P-elements
:eq(n)
: Finds which elements have a sibling index value n
equal to, for example, form input:eq(1)
a form element that contains a single input tag
:has(seletor)
: Finds the element that matches the selector containing the element, such as: div:has(p)
indicates which div contains the P element
:not(selector)
: Finds elements that do not match the selector, such as: div:not(.logo)
represents all div lists that do not contain the Class=logo element
:contains(text)
: Finds the element that contains the given text, and the search does not distinguish between large and non-writable, such as:p:contains(jsoup)
:containsOwn(text)
: Find elements that directly contain the given text
:matches(regex)
: Finds which elements of the text match the specified regular expression, such as:div:matches((?i)login)
:matchesOwn(regex)
: Finds an element that itself contains text that matches the specified regular expression
- Note: The above pseudo-selector index starts at 0, which means that the first element has an index value of 0, the second element is index 1, and so on
You can view Selector
the API reference to learn more
How to select elements of multiple class values
Example:<ul class="ul-ss-3 jb-xx-ks">
Method:
Elements select = document.select(".ul-ss-3").select(".jb-xx-bw");
Or
Elements select = document.getElementsByClass("ul-ss-3 jb-xx-bw");
Selector
API documentation
Official API Original: Selector (jsoup Java HTML Parser 1.11.3 API)
Pattern |
Matches |
Example |
* |
any element |
* |
tag |
The element with the given label signature |
div |
*|E |
The element of type E in any namespace. |
*|name Finds <fb:name> elements |
ns|E |
The element of type E in the namespace. |
fb|name Finds <fb:name> elements |
#id |
element with ID Property ID |
div#wrap ,#logo |
.class |
An element with the class name "class" |
div.left ,.result |
[attr] |
An element with a property of "attr" (Any value) |
a[href] ,[title] |
[^attrPrefix] |
The element whose property name begins with "Attrprefix". Finding elements using the HTML5 dataset |
[^data-] ,div[^data-] |
[attr=val] |
The attribute of the element is "attr" and the value is "Val" |
img[width=500] ,a[rel=nofollow] |
[attr="val"] |
The attribute of the element is "attr" and the value is "Val" |
span[hello="Cleveland"][goodbye="Columbus"] ,a[rel="nofollow"] |
[attr^=valPrefix] |
The attribute of the element is "attr" and the value begins with "Valprefix" |
a[href^=http:] |
[attr$=valSuffix] |
The attribute of the element is "attr" and the value ends with "Valfix" |
img[src$=.png] |
[attr*=valContaining] |
The attribute of the element is "attr", which contains the property value "Valcontains" |
a[href*=/search/] |
[attr~=*regex*] |
Element has a property named "Attr", and the value matches the regular expression |
img[src~=(?i)\\.(png|jpe?g)] |
|
The above can be combined in any order. |
div.header[title] |
Relationship Selector Combinators
Pattern |
Matches |
Example |
E F |
The F element derived from the E element |
div a ,.logo h1 |
E > F |
F is the direct sub-node of E |
ol > li |
E + F |
An F element, immediately before E. |
li + li ,div.head + div |
E ~ F |
Precede the F element with the E |
h1 ~ p |
E, F, G |
All matched elements e F G |
a[href], div, h3 |
Pseudo Selectors
Pattern |
Matches |
Example |
:lt(*n*) |
Elements whose sibling index is less than n |
td:lt(3) Find the first 3 cells in each row |
:gt(*n*) |
Elements whose sibling index is greater than n |
td:gt(1) Find cells after skipping the first two cells |
:eq(*n*) |
The element whose sibling index equals n |
td:eq(0) Find the first cell in each row |
:has(*selector*) |
An element that contains at least one element that matches the selector |
div:has(p) Find the div that contains the P element |
:not(*selector*) |
The element that does not match the selector. SeeElements.not(String) |
div:not(.logo) Find all divs that do not have a "logo" class. div:not(:has(div)) find a div that does not contain a div. |
:contains(*text*) |
The element that contains the specified text. The search is case insensitive. The text can appear in the found element, or it can appear in any of its descendant elements. |
p:contains(jsoup) Finds the P element that contains "Jsoup" text. |
:matches(*regex*) |
The element whose text matches the specified regular expression. The text can appear in the found element, or it can appear in any of its descendant elements. |
td:matches(\\d+) Finds table cells that contain numbers. div:matches((?i)login) Find the div that contains the text, not sensitive to the situation. |
:containsOwn(*text*) |
The element that directly contains the specified text. The search is case insensitive. The text must appear in the found element, not in any of its descendant elements. |
p:containsOwn(jsoup) Finds the P element that has its own text "Jsoup". |
:matchesOwn(*regex*) |
element whose own text matches the specified regular expression. The text must appear in the found element, not in any of its descendant elements. |
td:matchesOwn(\\d+) Find table cells that contain numbers directly. div:matchesOwn((?i)login) Find the div that contains the text, not sensitive to the situation. |
:containsData(*data*) |
The element that contains the specified data. script and style The content elements, and comment nodes (etc.) are considered to be data nodes, not text nodes. The search is case insensitive. The data may appear in the found element or in any of its descendants. |
script:contains(jsoup) Find the script element that contains the data "Jsoup". |
|
These can be combined in any order with other selectors |
.light:contains(name):eq(0) |
:matchText |
Treats a text node as an element, allowing you to match and select a text node. Note that using this selector modifies the DOM, so you might want to clone the document before you use it. |
p:matchText:firstChild With the input <p>One<br />Two</p> will return one PseudoTextElement with the text " One ". |
Structural Pseudo Selectors
Pattern |
Matches |
Example |
:root |
The element is the root of the document. In HTML, this is the html element |
:root |
:nth-child(*a*n+*b*) |
There are sibling elements in the document tree *a*n+*b*-1 , for any positive integer or 0 value of N, and with the parent element. For values A and b greater than 0, this effectively divides the child elements of the element into groups of elements (the last group takes the remainder) and selects the bth element for each group. For example, this allows the selector to process other rows in the table and can be used to replace the color of the paragraph text in a 4-week period. The values of a and B must be integers (positive, negative, or 0). The index of the first child element of the element is 1. In addition, :nth-child() you can use odd and even numbers as parameters. The odd number is the same as the 2n+1, even with the meaning of 2n. |
tr:nth-child(2n+1) Find each row in the table. :nth-child(10n-1) 9th, 19th, 29th, etc., element. The li:nth-child(5) 5h Li |
:nth-last-child(*a*n+*b*) |
There are sibling elements behind the document tree *a*n+*b*-1 . Otherwise like:nth-child() |
tr:nth-last-child(-n+2) The last two rows of the table |
:nth-of-type(*a*n+*b*) |
Pseudo-class notation represents an element that has *a*n+*b*-1 a sibling element that has the same extension element name in front of the document tree, n for any 0 or positive integer value, and has a parent element |
img:nth-of-type(2n+1) |
:nth-last-of-type(*a*n+*b*) |
Pseudo-class notation represents an element that has a *a*n+*b*-1 sibling element that, in the document tree, has a parent element for any element that has a value of N of 0 or a positive integer. |
img:nth-last-of-type(2n+1) |
:first-child |
Element is the first child element of another element. |
div > p:first-child |
:last-child |
The last child element of the other element. |
ol > li:last-child |
:first-of-type |
The first sibling element of its type in the list of child elements of the parent element |
dl dt:first-of-type |
:last-of-type |
element, which is the last sibling element of the type in the list of child elements of its parent element. |
tr > td:last-of-type |
:only-child |
element with parent element and no other element child element of parent element |
|
:only-of-type |
An element with a parent element whose parent element has no other element child element with the same expanded element name |
|
:empty |
No child elements. |
|
Detailed Jsoup Select selector syntax