Jsoup is a Java-based HTML parser that can parse a URL address or HTML text content directly. It provides a very labor-saving API that can be used to extract and manipulate data through dom,css and jquery-like operations.
Jsoup is powerful in its retrieval of document elements, the Select method returns a elements collection and provides a set of methods to extract and manipulate the results, mastering Jsoup first to familiarize itself with its selector syntax.
1. Basic syntax of selector selector
- TagName: Find elements by tags, such as: a
- Ns|tag: Find elements in namespaces through tags, such as: You can find <fb:name> elements with fb|name syntax
- #id: Find elements by ID, such as: #logo
- . Class: Finds elements by class name, for example:. Masthead
- [attribute]: Use attributes to find elements, such as: [href]
- [^attr]: Use the attribute name prefix to find elements, such as: You can use [^data-] to find the element with the HTML5 dataset property
- [Attr=value]: Use attribute values to find elements, such as: [width=500]
- [Attr^=value], [Attr$=value], [Attr*=value]: Finds an element with a matching attribute value beginning, ending, or containing an attribute value, such as: [href*=/path/]
- [Attr~=regex]: Use attribute values to match regular expressions to find elements, such as: img[src~= (? i) \. ( PNG|JPE?G)]
- *: This symbol will match all elements
2. Selector selector combination using syntax
- El#id: Element +id, for example: Div#logo
- El.class: Element +class, for example: Div.masthead
- El[attr]: element +class, for example: A[href]
- Any combination, such as: A[href].highlight
- Ancestor Child: Finds a child element of an element, such as: you can use the. Body p to find all p elements under the "body" element
- Parent > Child: Find immediate sub-elements under a parent element, such as: You can use Div.content > P to find the P element, or you can use body > * To find all the immediate child elements under the body tag
- Siblinga + SIBLINGB: Finds the first sibling element B before the A element, such as: Div.head + div
- Siblinga ~ Siblingx: Finds the sibling x element before the a element, such as: H1 ~ P
- El, El, el: Multiple selector combinations, finding unique elements that match either selector, for example: Div.masthead, Div.logo
3. Selector Pseudo-Selector syntax
- : LT (n): finds which element's sibling index value (its position is relative to its parent node in the DOM tree) is less than n, for example: Td:lt (3) represents an element less than three columns
- : GT (N): Find which elements have a sibling index value greater than N, for example: Div p:gt (2) indicates which Div contains more than 2 p elements
- : EQ (n): Find which elements have the same sibling index value as N, for example: Form Input:eq (1) represents a form element that contains an input tag
- : Has (Seletor): Finds elements that match selectors that contain elements, such as: Div:has (P), which div contains the P element
- : Not (selector): Finds elements that do not match the selector, such as: Div:not (. logo) for all Div lists that do not contain class= "logo" elements
- : Contains (text): Find the element containing the given text, search does not distinguish between large and non-written, such as: P:contains (Jsoup)
- : Containsown (text): Find the element that directly contains the given text
- : Matches (regex): finds which elements of text match the specified regular expression, such as: Div:matches ((? i) login)
- : Matchesown (Regex): Find an element that itself contains text that matches a specified regular expression
Note: The above pseudo-selector index starts at 0, which means that the first element has an index value of 0, the second element is index 1, and so on.
Jsoup Selector Syntax description