A very handy HTML parsing Java class Library Jsoup

Source: Internet
Author: User
Tags baseuri
Http://www.open-open.com/jsoup/parsing-a-document.htm
parse and traverse an HTML document

How to parse an HTML document:

String html = "

(More details can be viewed to parse an HTML string.)

Its parser is capable of maximizing a clean parsing result from the HTML document you provide, regardless of whether the HTML format is complete or not. For example, it can handle: no closed tags (such as: <p>lorem <p>ipsum parses to <p>Lorem</p> <p>Ipsum</p>) implicit tags ( Like what. It can automatically package <td>table data</td> into <table><tr><td>? Create a reliable document structure (HTML tag contains head and body, only appropriate element in head) the object Model document of a document consists of multiple elements and textnodes (and other auxiliary nodes: Detailed View: Nodes Package tree). Its inheritance structure is as follows: Document inherits element Inherits node. Textnode inherits Node. An element contains a collection of child nodes and has a parent element. They also provide a unique filtered list of child elements.
See Data extraction: DOM traversal data extraction: Selector syntax
loading a document from a URL There are problems

You need to get and parse an HTML document from a Web site and find the relevant data. You can use the following workaround: solution

Use the Jsoup.connect (String url) method:

Document doc = Jsoup.connect ("http://example.com/"). get ();
String title = Doc.title ();
Description

The Connect (String URL) method creates a new Connection, and get () obtains and analyzes an HTML file. If an error occurs when getting HTML from this URL, the IOException is thrown and should be handled appropriately.

The Connection interface also provides a method chain to resolve specific requests, as follows:

Document doc = Jsoup.connect ("http://example.com").
  data ("Query", "Java").
  useragent ("Mozilla")
  . Cookies ("auth", "token")
  . Timeout (3000)
  . Post ();

This method only supports Web URLs (HTTP and HTTPS protocols); If you need to load from a file, you can use parse (file in, String charsetname) instead.
load a document from a file problem

There is an HTML file on the local hard disk that needs to be parsed to extract data from it or modify it. approach

You can use the static Jsoup.parse (File in, String CharsetName, String BaseUri) method:

File input = new file ("/tmp/input.html");
Document doc = jsoup.parse (input, "UTF-8", "http://example.com/");
Description

Parse (file in, String CharsetName, String BaseUri) This method is used to load and parse an HTML file. If an error occurs while loading the file, the IOException is thrown and should be handled appropriately.

The BaseUri parameter is used to resolve a problem where the URLs in a file are relative paths. If you do not need to be able to pass in an empty string.

In addition, there is a method parse (file in, String CharsetName) that uses the path of the file as BaseUri. This method applies if the parsed file is located on the local file system of the Web site, and the related links also point to the file system.
using the DOM method to traverse a document problem

You have an HTML document to extract data from and understand the structure of this HTML document. Method

After parsing HTML into a document, you can manipulate it using a dom-like method . Sample code:

File input = new file ("/tmp/input.html");
Document doc = jsoup.parse (input, "UTF-8", "http://example.com/");

Element content = Doc.getelementbyid ("content");
Elements links = Content.getelementsbytag ("a");
for (Element link:links) {
  String linkhref = link.attr ("href");
  String LinkText = Link.text ();
}
Description

The

Elements object provides a series of Dom-like methods for finding elements, extracting and processing the data. This is as follows: find element getElementById (string id) Getelementsbytag (string tag) Getelementsbyclass (string ClassName) Getelementsbyattribute (String key)   (and related methods) Element siblings: siblingelements (),  firstelementsibling (),  lastelementsibling (); nextelementsibling (),  previouselementsibling () Graph: parent (),  children (),  child (int index) element data attr (string key) Gets the property attr (string Key, String value) sets the property attributes () to get all the property IDs (),  classname ()  and classnames () text () to get textual content (string Value   Sets the text content HTML () Gets the HTML content within the htmlhtml (String value) element within the elements outerhtml () Gets the HTML content outside the element data () Gets the contents of the data (for example: script and style tags) tag ()  and tagname () manipulating HTML and text append (String html),  Prepend (String html)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.