Jsoup Getting Started-parsing and traversing an HTML document

Last Update:2014-07-14 Source: Internet

Author: User

Tags tagname

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Parsing and traversing an HTML document

How to parse an HTML document:

String html = "

(more details can be seen parsing an HTML string.)

Its parser can do everything possible from the HTML document you provide to transcend a clean parsing result, regardless of whether the HTML format is complete. For example, it can handle:

 
  
   
   No closed tags (e.g. parses to <p>Lorem <p>Ipsum <p>Lorem</p> <p>Ipsum</p> ) 
   implicit tags (e.g., it can be wrapped automatically <td>Table data</td> <table><tr><td>? ) 
   Create a reliable document structure (HTML tags contain head and body, only the right elements appear in the head) 
   
 Object model for a document 
  
   
   The document consists of multiple elements and textnodes (as well as other auxiliary nodes: details can be viewed: Nodes package tree). 
   Its inheritance structure is as follows: Document inheritance inheritance Element Node . TextNode Inheritance Node . 
   An element contains a collection of child nodes and has a parent element. They also provide a unique sub-element filter list. 
   
 
Data extraction
You have an HTML document that you want to extract data from. And you know the general structure of HTML documents. An HTML document can be parsed using similar DOM methods.
1     /**2 * Get htmlelement element3      * @authorBling4      * @throwsIOException5 * @create date:2014-07-136      */7 @Test8      Public voidgetDataElement ()throwsioexception{9File input =NewFile ("tmp/input.html");TenDocument doc = jsoup.parse (input, "UTF-8", "http://example.com/"); One          AElement content = Doc.getelementbyid ("Content"); -Elements links = Content.getelementsbytag ("a"); -          for(Element link:links) { theString linkhref = link.attr ("href"); -String LinkText =Link.text (); -System.out.println ("Linkhref:" +linkhref+ "------" + "LinkText:" +linkText); -         } +}
Elements provides a method similar to find element, and extracts operational data, the DOM object is context: finds the document under match based on Father document and finds the child element under it based on the document found. Use this method to find the data you want.
 
  
   
   Ways to get elements 
   
  
  
   
   getElementById(String id) 
   getElementsByTag(String tag) 
   getElementsByClass(String className) 
   getElementsByAttribute(String key)(and related methods) 
   Element siblings: siblingElements() , firstElementSibling() , lastElementSibling() ; nextElementSibling() ,previousElementSibling() 
   Graph: parent() , children() ,child(int index) 
   
  
  
   
   Methods for obtaining the element data 
   
  
  
   
   attr(String key)To get and to attr(String key, String value) set attributes 
   attributes()To get all attributes 
   id(), and className()classNames() 
   text()To get and to text(String value) set the text content 
   html()To get and to html(String value) set the inner HTML content 
   outerHtml()To get the outer HTML value 
   data()To get data content (e.g of and script style tags) 
   tag()andtagName() 
   
  
  
   
   Methods for manipulating HTML and text 
   
  
  
   
   append(String html),prepend(String html) 
   appendText(String text),prependText(String text) 
   appendElement(String tagName),prependElement(String tagName) 
   html(String value) 
   
  
  
   
   Data extraction: Selector syntax (using selector syntax, reference) 
   
 
GitHub Example code: Https://github.com/Java-Group-Bling/Jsoup-learn

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More