Introduction to parsing HTML jsoup and HTML DOM using Jsoup

Source: Internet
Author: User

Jsoup is a Java HTML parser that can parse a URL address and HTML text content directly. It provides a very labor-saving API that can be used to extract and manipulate data through dom,css and jquery-like operations. Similar parsers also have Htmlparser, which is widely used, but htmlparser lacks maintenance, and the last version of the release remains in 2006 (http://sourceforge.net/projects/htmlparser/files/).

Before parsing HTML with Jsoup, it is necessary to have an understanding of the DOM structure of HTML. Because classes such as node, Element, document, and so on are defined in Jsoup, you must be aware of the meanings that these classes represent. The following is a class-level diagram of Org.jsoup.nodes this package in Jsoup:

Nodes (node)

Node is the most basic, abstract node model. Elements, Documents, comments, and so on are all instances of node (or subclass). Each component in the dom,html document is a node.

This is what the DOM provides:

    • The entire document is a document node
    • Each HTML tag is an element node
    • Text that is contained in an HTML element is a text node
    • Each HTML attribute is an attribute node
    • Comments belong to note nodes
Node hierarchy

Nodes have hierarchical relationships with each other.

All the nodes in the HTML document make up a document tree (or node tree). Each element, attribute, text, and so on in an HTML document represents a node in the tree. The tree starts at the document node and continues to extend its branches until all the text nodes are at the lowest level of the tree.

      <title>DOM Tutorial</title>      <body>     DOM Lesson One     <p>Hello world!</p>   </body> 

All nodes above have relationships with each other.

Each node except the document node has a parent node . For example, the parent node of,

Most element nodes have child nodes . Let's say the,

When nodes share the same parent node, they are peers (sibling nodes). For example,,

Nodes can also have descendants , which refer to all child nodes of a node, or child nodes of those child nodes, and so on. For example, all text nodes are descendants of the

Nodes can also have ancestors . Ancestors are the parent node of a node, or parent node of a parent node, and so on. For example, all text nodes can use the

Element

An HTML element contains a tag name, attributes, and child nodes (including text nodes and other elements). From an element, you can extract the data, traverse the node graph, and manipulate the HTML. The element class in Jsoup provides several ways to find the elements you want to manipulate.

    • Getallelements () returns all elements, including child elements of the element, and child elements of the child element
    • getElementById (String ID) finds an element based on the ID of the element
    • Getelementsbyattribute (String key) to find an element based on the property name
    • Getelementsbyattributevalue (string key, String value) to search for elements based on property key value
    • Getelementsbyclass (String className) finds elements based on CSS's class
    • Getelementsbytag (String tagName) finds elements based on tag name
<div ID="Imgdiv" class="Imgclass"><a href="#">   src="Http://xxxx.xx/xx.jpg"/></a></div>

In the example above, we can use Getelementsbytag ("img") to get to "

Similarly, you can use getElementById ("Imgdiv") and Getelementsbyclass ("Imgclass") to get to the entire DOM document.

Introduction to parsing HTML jsoup and HTML DOM using Jsoup

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.