Jsoup: parse the HTML usage summary and jsouphtml usage Summary
1. Resolution Method
(1) Parse strings
String html =
"<body><p>Parse HTML into a doc.</p></body>;
Document doc = Jsoup.parse(html);
? (2) retrieve resolution from URL
Document doc = Jsoup.connect(
"http://example.com/"
).get();
String title = doc. title ();
Document doc = Jsoup.connect(
"http://example.com"
)
.data(
"query"
,
"Java"
).userAgent(
"Mozilla"
).cookie(
"auth"
,
"token"
).timeout(
3000
).post();
??
(3) parsing from a file
File input =
new
File(
"/tmp/input.html"
);
Document doc = Jsoup.parse(input,
"UTF-8"
,
"http://example.com/"
);
2. DOM-Based Element Traversal
(1) Search Elements
getElementById(String id)
getElementByTag(String tag)
getElementByClass(String className)
getElementByAttribute(String key)
siblingElements(), firstElementSibling(), lastElementSibling(), nextElementSibling(), previousElementSibling()
parent(), children(), child(
int
index)
(2) Retrieving Element Data
Attr (String key)-Get key attributes
Attributes ()-Get attributes
id(), className(), classNames()
Text ()-Get text Content
Html ()-Get the HTML content inside the element
OuterHtml ()-Get the HTML content containing this element
Data ()-Get the content in the <srcept> or <style> label
tag(), tagName()
3. selector syntax (the difference between jsoup and other Resolvers is that you can use jquery-like selector syntax to search for and filter out the required elements)
(1) Basic Selector
Tagname: Search tag Elements
Ns | tag: Search for the tag elements in a namespace, such as fb | name: <fb: name>
# Id: Search for elements with a specified id
.
class
: Specified search
class
Element
[Attribute]: searches for elements with the attrribute attribute.
[^ Attri]: searches for elements with attributes starting with attri.
[Attr = value]: searches for elements with specified attributes and Their attribute values.
[Attr ^ = value], [attr $ = value], [attr * = value]: The specified attr attribute is found, the attribute value starts with, ends with, or includes the value element, for example, [href * =/path/].
[Attr ~ = Regex]: searches for elements with the specified attr attribute and whose attribute value complies with the regex regular expression.
*: Search for all elements
(2) selector combination
El # id: Specify the Tag Name and id at the same time.
el.
class
: Specify both the Tag Name and
class
El [attr]: Specify the tag name and the attribute name.
Above
3
Any combination of items, such as a [href]. highlight
Ancestor child: Contains, such as div. content p, that is, search <div
class
= "Content"> elements with <p> tags
Ancestor> child: Contains directly, such as div. content> p, that is, directly <div
class
=
"content"
> <P> label element under the node; div. content> *, that is, search <div
class
=
"content"
> All elements under
SiblingA + siblingB: directly traversing, such as div. head + div, that is, searching <div
class
=
"head"
> <Div>, which no longer contains child elements
SiblingA ~ SiblingX: traversal, such as h1 ~ P, that is, El, el, el: combines multiple selectors to search for elements that meet one of them.
(3) pseudo selector (condition selector)
: Lt (n): Search for elements before element n
: Gt (n): Search for elements after element n
: Eq (n): Search for element n
: Has (seletor): searches for elements that match the specified selector.
: Not (seletor): searches for elements that do not match the specified selector.
: Contains (text): searches for elements that contain specified text, case sensitive
: ContainsOwn (text): Search directly refers to the element that contains the specified text
: Matches (regex): searches for elements that match the specified regular expression.
: MatchesOwn (regex): searches for elements that match the specified Regular Expression in the element text.
Note: In the index of the pseudo selector above, the first element is located in the index.
0
, The second element is in the Index
1
,……
4. Obtain the attributes, text, and HTML of an element.
Get the attribute value of an element: Node. attr (String key)
Obtains the text of an Element, including its child Element: Element. text ()
Obtain HTML: Element.html () or Node. outerHtml ()
5. Operation URL
Element.attr(
"href"
)-Directly obtain the URL
Element.attr(
"abs:href"
) Or Element. absUrl (
"href"
)-Obtain the complete URL. If HTML is parsed from a file or String, you need to call Jsoup. setBaseUri (String baseUri) to specify the base URL. Otherwise, the obtained complete URL will only be a null String.
6. test example
li[
class
=info] a[
class
= Author]-a space indicates the inclusion relationship, that is, a in li
div[
class
= Mod-main mod-lmain]: contains (Teaching Reflection)-div contains
"Reflection on teaching"
Suitable for multiple DIV with the same name at the same time
/*
Previussibling () obtains the code before a tag.
NextSibling () code after obtaining a tag
For example:
<form id=form1>
First place: Lily <br/>
Second place: Tom <br/>
Third place: Peter <br/>
</form>
*/
Elements items = doc.select(
"form[id=form1]"
);
Elements prevs = items.select(
"br"
);
for
(Element p : prevs){
String prevStr = p.previousSibling().toString().trim());
}
/*
Most common link crawling
*/
String itemTag =
"div[class=mydiv]"
;
String linkTag =
"a"
Elements items = doc.select(itemTag);
Elements links = items.select(linkTag);
for
(Element l : links){
String href = l.attr(
"abs:href"
);
// Complete Href
String absHref = l.attr(
"href"
);
// Relative path
String text = l.text();
String title = l.attr(
"title"
);
}
7. jsoup online API
Http://jsoup.org/apidocs/