Jsoup: parse the HTML usage summary and jsouphtml usage Summary

Last Update:2015-06-20 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Jsoup: parse the HTML usage summary and jsouphtml usage Summary
1. Resolution Method

(1) Parse strings

String html = "<body><p>Parse HTML into a doc.</p></body>;

 Document doc = Jsoup.parse(html);
? (2) retrieve resolution from URL
 Document doc = Jsoup.connect("http://example.com/").get();
 String title = doc. title ();

Document doc = Jsoup.connect("http://example.com") .data("query","Java").userAgent("Mozilla").cookie("auth","token").timeout(3000).post();

(3) parsing from a file

File input = newFile("/tmp/input.html");

Document doc = Jsoup.parse(input, "UTF-8","http://example.com/");

2. DOM-Based Element Traversal (1) Search Elements

getElementById(String id) getElementByTag(String tag) getElementByClass(String className) getElementByAttribute(String key) siblingElements(), firstElementSibling(), lastElementSibling(), nextElementSibling(), previousElementSibling() parent(), children(), child( int index)

(2) Retrieving Element Data

Attr (String key)-Get key attributes Attributes ()-Get attributes id(), className(), classNames() Text ()-Get text Content Html ()-Get the HTML content inside the element OuterHtml ()-Get the HTML content containing this element Data ()-Get the content in the <srcept> or <style> label tag(), tagName()

3. selector syntax (the difference between jsoup and other Resolvers is that you can use jquery-like selector syntax to search for and filter out the required elements)
(1) Basic Selector

Tagname: Search tag Elements Ns | tag: Search for the tag elements in a namespace, such as fb | name: <fb: name> # Id: Search for elements with a specified id . class : Specified search class Element [Attribute]: searches for elements with the attrribute attribute. [^ Attri]: searches for elements with attributes starting with attri. [Attr = value]: searches for elements with specified attributes and Their attribute values.

[Attr ^ = value], [attr $ = value], [attr * = value]: The specified attr attribute is found, the attribute value starts with, ends with, or includes the value element, for example, [href * =/path/].

[Attr ~ = Regex]: searches for elements with the specified attr attribute and whose attribute value complies with the regex regular expression.

*: Search for all elements(2) selector combination
El # id: Specify the Tag Name and id at the same time. el. class : Specify both the Tag Name and class El [attr]: Specify the tag name and the attribute name. Above 3 Any combination of items, such as a [href]. highlight Ancestor child: Contains, such as div. content p, that is, search <div class = "Content"> elements with <p> tags Ancestor> child: Contains directly, such as div. content> p, that is, directly <div class = "content" > <P> label element under the node; div. content> *, that is, search <div class = "content" > All elements under SiblingA + siblingB: directly traversing, such as div. head + div, that is, searching <div class = "head" > <Div>, which no longer contains child elements

SiblingA ~ SiblingX: traversal, such as h1 ~ P, that is,  El, el, el: combines multiple selectors to search for elements that meet one of them.(3) pseudo selector (condition selector) 
 : Lt (n): Search for elements before element n : Gt (n): Search for elements after element n : Eq (n): Search for element n : Has (seletor): searches for elements that match the specified selector. : Not (seletor): searches for elements that do not match the specified selector. : Contains (text): searches for elements that contain specified text, case sensitive : ContainsOwn (text): Search directly refers to the element that contains the specified text : Matches (regex): searches for elements that match the specified regular expression. : MatchesOwn (regex): searches for elements that match the specified Regular Expression in the element text. Note: In the index of the pseudo selector above, the first element is located in the index. 0 , The second element is in the Index 1 ，……4. Obtain the attributes, text, and HTML of an element. 
 
 
 Get the attribute value of an element: Node. attr (String key) Obtains the text of an Element, including its child Element: Element. text () Obtain HTML: Element.html () or Node. outerHtml ()5. Operation URL 
 
 
 Element.attr( "href" )-Directly obtain the URL Element.attr( "abs:href" ) Or Element. absUrl ( "href" )-Obtain the complete URL. If HTML is parsed from a file or String, you need to call Jsoup. setBaseUri (String baseUri) to specify the base URL. Otherwise, the obtained complete URL will only be a null String.6. test example 
 
 
 li[ class =info] a[ class = Author]-a space indicates the inclusion relationship, that is, a in li div[ class = Mod-main mod-lmain]: contains (Teaching Reflection)-div contains "Reflection on teaching" Suitable for multiple DIV with the same name at the same time /*    Previussibling () obtains the code before a tag.    NextSibling () code after obtaining a tag    For example:    <form id=form1>    First place: Lily <br/>    Second place: Tom <br/>    Third place: Peter <br/>    </form> */ Elements items = doc.select( "form[id=form1]" ); Elements prevs = items.select( "br" ); for (Element p : prevs){     String prevStr = p.previousSibling().toString().trim()); } /*   Most common link crawling */ String itemTag =  "div[class=mydiv]" ; String linkTag =  "a" Elements items = doc.select(itemTag); Elements links = items.select(linkTag); for (Element l : links){     String href = l.attr( "abs:href" ); // Complete Href    String absHref = l.attr( "href" ); // Relative path    String text = l.text();    String title = l.attr( "title" ); }7. jsoup online API 
Http://jsoup.org/apidocs/

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Jsoup: parse the HTML usage summary and jsouphtml usage Summary

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Jsoup: parse the HTML usage summary and jsouphtml usage Summary

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support