Jsoup: parse the HTML usage summary and jsouphtml usage Summary

Source: Internet
Author: User

Jsoup: parse the HTML usage summary and jsouphtml usage Summary
1. Resolution Method

(1) Parse strings

String html = "<body><p>Parse HTML into a doc.</p></body>;

Document doc = Jsoup.parse(html);

?

(2) retrieve resolution from URL

Document doc = Jsoup.connect("http://example.com/").get();

String title = doc. title ();

Document doc = Jsoup.connect("http://example.com") .data("query","Java").userAgent("Mozilla").cookie("auth","token").timeout(3000).post();

??

(3) parsing from a file

File input = newFile("/tmp/input.html"); 

Document doc = Jsoup.parse(input, "UTF-8","http://example.com/");


2. DOM-Based Element Traversal
(1) Search Elements

getElementById(String id) getElementByTag(String tag) getElementByClass(String className) getElementByAttribute(String key) siblingElements(), firstElementSibling(), lastElementSibling(), nextElementSibling(), previousElementSibling() parent(), children(), child( int index) (2) Retrieving Element Data
Attr (String key)-Get key attributes Attributes ()-Get attributes id(), className(), classNames() Text ()-Get text Content Html ()-Get the HTML content inside the element OuterHtml ()-Get the HTML content containing this element Data ()-Get the content in the <srcept> or <style> label tag(), tagName()
3. selector syntax (the difference between jsoup and other Resolvers is that you can use jquery-like selector syntax to search for and filter out the required elements)
(1) Basic Selector
Tagname: Search tag Elements Ns | tag: Search for the tag elements in a namespace, such as fb | name: <fb: name> # Id: Search for elements with a specified id . class : Specified search class Element [Attribute]: searches for elements with the attrribute attribute. [^ Attri]: searches for elements with attributes starting with attri. [Attr = value]: searches for elements with specified attributes and Their attribute values. [Attr ^ = value], [attr $ = value], [attr * = value]: The specified attr attribute is found, the attribute value starts with, ends with, or includes the value element, for example, [href * =/path/]. [Attr ~ = Regex]: searches for elements with the specified attr attribute and whose attribute value complies with the regex regular expression. *: Search for all elements(2) selector combination
El # id: Specify the Tag Name and id at the same time. el. class : Specify both the Tag Name and class El [attr]: Specify the tag name and the attribute name. Above 3 Any combination of items, such as a [href]. highlight Ancestor child: Contains, such as div. content p, that is, search <div class = "Content"> elements with <p> tags Ancestor> child: Contains directly, such as div. content> p, that is, directly <div class = "content" > <P> label element under the node; div. content> *, that is, search <div class = "content" > All elements under SiblingA + siblingB: directly traversing, such as div. head + div, that is, searching <div class = "head" > <Div>, which no longer contains child elements SiblingA ~ SiblingX: traversal, such as h1 ~ P, that is, El, el, el: combines multiple selectors to search for elements that meet one of them.(3) pseudo selector (condition selector)
: Lt (n): Search for elements before element n : Gt (n): Search for elements after element n : Eq (n): Search for element n : Has (seletor): searches for elements that match the specified selector. : Not (seletor): searches for elements that do not match the specified selector. : Contains (text): searches for elements that contain specified text, case sensitive : ContainsOwn (text): Search directly refers to the element that contains the specified text : Matches (regex): searches for elements that match the specified regular expression. : MatchesOwn (regex): searches for elements that match the specified Regular Expression in the element text. Note: In the index of the pseudo selector above, the first element is located in the index. 0 , The second element is in the Index 1 ,……4. Obtain the attributes, text, and HTML of an element.

Get the attribute value of an element: Node. attr (String key) Obtains the text of an Element, including its child Element: Element. text () Obtain HTML: Element.html () or Node. outerHtml ()5. Operation URL

Element.attr( "href" )-Directly obtain the URL Element.attr( "abs:href" ) Or Element. absUrl ( "href" )-Obtain the complete URL. If HTML is parsed from a file or String, you need to call Jsoup. setBaseUri (String baseUri) to specify the base URL. Otherwise, the obtained complete URL will only be a null String.6. test example

li[ class =info] a[ class = Author]-a space indicates the inclusion relationship, that is, a in li div[ class = Mod-main mod-lmain]: contains (Teaching Reflection)-div contains "Reflection on teaching" Suitable for multiple DIV with the same name at the same time /*    Previussibling () obtains the code before a tag.    NextSibling () code after obtaining a tag    For example:    <form id=form1>    First place: Lily <br/>    Second place: Tom <br/>    Third place: Peter <br/>    </form> */ Elements items = doc.select( "form[id=form1]" ); Elements prevs = items.select( "br" ); for (Element p : prevs){     String prevStr = p.previousSibling().toString().trim()); } /*   Most common link crawling */ String itemTag = "div[class=mydiv]" ; String linkTag = "a" Elements items = doc.select(itemTag); Elements links = items.select(linkTag); for (Element l : links){    String href = l.attr( "abs:href" ); // Complete Href    String absHref = l.attr( "href" ); // Relative path    String text = l.text();    String title = l.attr( "title" ); }7. jsoup online API
Http://jsoup.org/apidocs/


Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.