A very handy HTML parsing Java class Library Jsoup

Last Update:2018-07-26 Source: Internet

Author: User

Tags baseuri

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Http://www.open-open.com/jsoup/parsing-a-document.htm
parse and traverse an HTML document

How to parse an HTML document:

String html = " 
(More details can be viewed to parse an HTML string.)

Its parser is capable of maximizing a clean parsing result from the HTML document you provide, regardless of whether the HTML format is complete or not. For example, it can handle: no closed tags (such as: <p>lorem <p>ipsum parses to <p>Lorem</p> <p>Ipsum</p>) implicit tags ( Like what. It can automatically package <td>table data</td> into <table><tr><td>? Create a reliable document structure (HTML tag contains head and body, only appropriate element in head) the object Model document of a document consists of multiple elements and textnodes (and other auxiliary nodes: Detailed View: Nodes Package tree). Its inheritance structure is as follows: Document inherits element Inherits node. Textnode inherits Node. An element contains a collection of child nodes and has a parent element. They also provide a unique filtered list of child elements.
See Data extraction: DOM traversal data extraction: Selector syntax
loading a document from a URL There are problems

You need to get and parse an HTML document from a Web site and find the relevant data. You can use the following workaround: solution

Use the Jsoup.connect (String url) method:

Document doc = Jsoup.connect ("http://example.com/"). get ();
String title = Doc.title ();
 
Description 
The Connect (String URL) method creates a new Connection, and get () obtains and analyzes an HTML file. If an error occurs when getting HTML from this URL, the IOException is thrown and should be handled appropriately.

The Connection interface also provides a method chain to resolve specific requests, as follows:

Document doc = Jsoup.connect ("http://example.com").
  data ("Query", "Java").
  useragent ("Mozilla")
  . Cookies ("auth", "token")
  . Timeout (3000)
  . Post ();
 
This method only supports Web URLs (HTTP and HTTPS protocols); If you need to load from a file, you can use parse (file in, String charsetname) instead.
load a document from a file problem

There is an HTML file on the local hard disk that needs to be parsed to extract data from it or modify it. approach

You can use the static Jsoup.parse (File in, String CharsetName, String BaseUri) method:

File input = new file ("/tmp/input.html");
Document doc = jsoup.parse (input, "UTF-8", "http://example.com/");
 
Description 
Parse (file in, String CharsetName, String BaseUri) This method is used to load and parse an HTML file. If an error occurs while loading the file, the IOException is thrown and should be handled appropriately.

The BaseUri parameter is used to resolve a problem where the URLs in a file are relative paths. If you do not need to be able to pass in an empty string.

In addition, there is a method parse (file in, String CharsetName) that uses the path of the file as BaseUri. This method applies if the parsed file is located on the local file system of the Web site, and the related links also point to the file system.
using the DOM method to traverse a document problem

You have an HTML document to extract data from and understand the structure of this HTML document. Method

After parsing HTML into a document, you can manipulate it using a dom-like method . Sample code:

File input = new file ("/tmp/input.html");
Document doc = jsoup.parse (input, "UTF-8", "http://example.com/");

Element content = Doc.getelementbyid ("content");
Elements links = Content.getelementsbytag ("a");
for (Element link:links) {
  String linkhref = link.attr ("href");
  String LinkText = Link.text ();
}
 
Description 
The 
 Elements object provides a series of Dom-like methods for finding elements, extracting and processing the data. This is as follows:  find element  getElementById (string id) Getelementsbytag (string tag) Getelementsbyclass (string ClassName) Getelementsbyattribute (String key)   (and related methods) Element siblings: siblingelements (),  firstelementsibling (),  lastelementsibling (); nextelementsibling (),  previouselementsibling () Graph: parent (),  children (),  child (int index)  element data  attr (string key) Gets the property attr (string Key, String value) sets the property attributes () to get all the property IDs (),  classname ()  and classnames () text () to get textual content (string Value   Sets the text content HTML () Gets the HTML content within the htmlhtml (String value) element within the elements outerhtml () Gets the HTML content outside the element data () Gets the contents of the data (for example: script and style tags) tag ()  and tagname ()  manipulating HTML and text  append (String html),  Prepend (String html)

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More