Jsoup Getting Started-parsing and traversing an HTML document

Source: Internet
Author: User
Tags tagname

Parsing and traversing an HTML document

How to parse an HTML document:

String html = "

(more details can be seen parsing an HTML string.)

Its parser can do everything possible from the HTML document you provide to transcend a clean parsing result, regardless of whether the HTML format is complete. For example, it can handle:

    • No closed tags (e.g. parses to <p>Lorem <p>Ipsum <p>Lorem</p> <p>Ipsum</p> )
    • implicit tags (e.g., it can be wrapped automatically <td>Table data</td> <table><tr><td>? )
    • Create a reliable document structure (HTML tags contain head and body, only the right elements appear in the head)
Object model for a document
    • The document consists of multiple elements and textnodes (as well as other auxiliary nodes: details can be viewed: Nodes package tree).
    • Its inheritance structure is as follows: Document inheritance inheritance Element Node . TextNode Inheritance Node .
    • An element contains a collection of child nodes and has a parent element. They also provide a unique sub-element filter list.
Data extraction

You have an HTML document that you want to extract data from. And you know the general structure of HTML documents. An HTML document can be parsed using similar DOM methods.

1     /**2 * Get htmlelement element3      * @authorBling4      * @throwsIOException5 * @create date:2014-07-136      */7 @Test8      Public voidgetDataElement ()throwsioexception{9File input =NewFile ("tmp/input.html");TenDocument doc = jsoup.parse (input, "UTF-8", "http://example.com/"); One          AElement content = Doc.getelementbyid ("Content"); -Elements links = Content.getelementsbytag ("a"); -          for(Element link:links) { theString linkhref = link.attr ("href"); -String LinkText =Link.text (); -System.out.println ("Linkhref:" +linkhref+ "------" + "LinkText:" +linkText); -         } +}

Elements provides a method similar to find element, and extracts operational data, the DOM object is context: finds the document under match based on Father document and finds the child element under it based on the document found. Use this method to find the data you want.

    • Ways to get elements
    1. getElementById(String id)
    2. getElementsByTag(String tag)
    3. getElementsByClass(String className)
    4. getElementsByAttribute(String key)(and related methods)
    5. Element siblings: siblingElements() , firstElementSibling() , lastElementSibling() ; nextElementSibling() ,previousElementSibling()
    6. Graph: parent() , children() ,child(int index)
    • Methods for obtaining the element data
    1. attr(String key)To get and to attr(String key, String value) set attributes
    2. attributes()To get all attributes
    3. id(), and className()classNames()
    4. text()To get and to text(String value) set the text content
    5. html()To get and to html(String value) set the inner HTML content
    6. outerHtml()To get the outer HTML value
    7. data()To get data content (e.g of and script style tags)
    8. tag()andtagName()
    • Methods for manipulating HTML and text
    1. append(String html),prepend(String html)
    2. appendText(String text),prependText(String text)
    3. appendElement(String tagName),prependElement(String tagName)
    4. html(String value)
    • Data extraction: Selector syntax (using selector syntax, reference)

GitHub Example code: Https://github.com/Java-Group-Bling/Jsoup-learn

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.