Jsoup Parse HTML instance and document method detailed _java

Source: Internet
Author: User
Tags abs baseuri nets object model tagname

Parse and traverse an HTML document

how to parse an HTML document :

Copy Code code as follows:

String html = "+ "<body><p>parsed HTML into a doc.</p></body>Document doc = jsoup.parse (HTML);

Its parser is capable of maximizing a clean parsing result from the HTML document you provide, regardless of whether the HTML format is complete or not. For example, it can handle:

1, no closed tags (for example: <p>lorem <p>ipsum parses to <p>Lorem</p> <p>Ipsum</p>)
2, the implicit label (for example, it can automatically <td>table data</td> packaging into <table><tr><td>?)
3, create a reliable document structure (HTML tag contains head and body, only the appropriate elements in head)

Object model for a document

1, the document consists of multiple elements and textnodes (and other auxiliary nodes).
2, its inheritance structure is as follows: Document inherits element Inherits node. Textnode inherits Node.
3, an element contains a set of child nodes and has a parent element. They also provide a unique filtered list of child elements.

Loading a document from a URL

There are problems
You need to get and parse an HTML document from a Web site and find the relevant data. You can use the following workaround:

Solving method
Use the Jsoup.connect (String url) method:

Copy Code code as follows:

Document doc = Jsoup.connect ("http://www.jb51.net/"). get ();
String title = Doc.title ();

Description
The Connect (String URL) method creates a new Connection, and get () obtains and analyzes an HTML file. If an error occurs when getting HTML from this URL, the IOException is thrown and should be handled appropriately.

The Connection interface also provides a method chain to resolve specific requests, as follows:

Copy Code code as follows:

Document doc = Jsoup.connect ("Http://www.jb51.net")
. Data ("Query", "Java")
. useragent ("Mozilla")
. Cookies ("auth", "token")
. Timeout (3000)
. Post ();

This method only supports Web URLs (HTTP and HTTPS protocols); If you need to load from a file, you can use parse (file in, String charsetname) instead.

Load a document from a file

Problem
There is an HTML file on the local hard disk that needs to be parsed to extract data from it or modify it.

Way
You can use the static Jsoup.parse (File in, String CharsetName, String BaseUri) method:

Copy Code code as follows:

File input = new file ("/tmp/input.html");
Document doc = jsoup.parse (input, "UTF-8", "http://www.jb51.net/");

Description
Parse (file in, String CharsetName, String BaseUri) This method is used to load and parse an HTML file. If an error occurs while loading the file, the IOException is thrown and should be handled appropriately.
The BaseUri parameter is used to resolve a problem where the URLs in a file are relative paths. If you do not need to be able to pass in an empty string.
In addition, there is a method parse (file in, String CharsetName) that uses the path of the file as BaseUri. This method applies if the parsed file is located on the local file system of the Web site, and the related links also point to the file system.


Using the DOM method to traverse a document

Problem
You have an HTML document to extract data from and understand the structure of this HTML document.

Method
After parsing HTML into a document, you can manipulate it using a DOM-like method. Sample code:

Copy Code code as follows:

File input = new file ("/tmp/input.html");
Document doc = jsoup.parse (input, "UTF-8", "http://www.jb51.net/");

Element content = Doc.getelementbyid ("content");
Elements links = Content.getelementsbytag ("a");
for (Element link:links) {
String linkhref = link.attr ("href");
String LinkText = Link.text ();
}

Description
Elements This object provides a series of Dom-like methods for finding elements, extracting and processing the data. Specifically as follows:
Find elements
getElementById (String ID)
Getelementsbytag (String tag)
Getelementsbyclass (String className)
Getelementsbyattribute (String key) (and related methods)
Element siblings:siblingelements (), firstelementsibling (), lastelementsibling (); nextelementsibling (), Previouselementsibling ()
Graph:parent (), children (), child (int index)

Element data
attr (string key) Gets the property attr (string key, String value) Set property
Attributes () Get all properties
ID (), className () and Classnames ()
Text () Get textual content (String value) to set text content
HTML () Gets the HTML content within the htmlhtml (String value) setting element within the element
outerHTML () Get HTML content outside the element
Data (for example: script and Style labels)
Tag () and TagName ()

Manipulating HTML and text
Append (string html), prepend (string html)
AppendText (string text), Prependtext (string text)
Appendelement (String tagName), Prependelement (string tagName)
HTML (String value)


Use selector syntax to find elements
Problem
You want to use syntax similar to CSS or jquery to find and manipulate elements.

Method
You can use the Element.select (string selector) and Elements.select (string selector) methods to implement:

Copy Code code as follows:

File input = new file ("/tmp/input.html");
Document doc = jsoup.parse (input, "UTF-8", "http://www.jb51.net./");

Elements links = doc.select ("a[href]"); A element with the href attribute
Elements PNGs = Doc.select ("img[src$=.png]");
Picture with a. png extension

Element masthead = Doc.select ("Div.masthead").
Class equals Masthead div tag

Elements resultlinks = Doc.select ("H3.R > A"); The a element after the H3 element

Description
The Jsoup elements object supports selector syntax similar to CSS (or jquery) to achieve very powerful and flexible lookup capabilities.
This select method can be used in document, Element, or elements objects. And is context-sensitive, you can either implement filtering for the specified element or chain-select access.
The Select method returns a elements collection and provides a set of methods to extract and process the results.

Selector Selector Overview
tagname: Find elements by tags, such as: a
Ns|tag: Find elements in namespaces by tags, such as: You can use the Fb|name syntax to find <FB: The name> element
#id: Find an element by ID, such as: #logo
. Class: Find elements by class name, such as:. Masthead
[attribute]: Use attributes to find elements, such as: [href]
[^attr]: Use property name prefixes to find elements, such as: You can use [^data-] to find elements with HTML5 dataset attributes
[Attr=value]: Find elements using attribute values, such as: [width=500]
[attr^= Value], [Attr$=value], [Attr*=value]: Finds an element with a matching property value beginning, ending, or containing a property value, such as: [href*=/path/]
[Attr~=regex]: Match a regular expression with a property value to find an element, such as: img[src~= (? i) \. ( PNG|JPE?G)]
*: This symbol will match all elements

The

Selector selector combination uses the
el#id: element +id, such as: Div#logo
El.class: element +class, such as: Div.masthead
El[attr ]: element +class, such as: A[href]
Any combination, such as: A[href].highlight
Ancestor Child: Look for elements of an element, such as: you can use. body p to find all p elements under the "body" element
Parent > Child: Find the direct subelements under a parent element, such as: You can find P elements with div.content > P, or you can find all the direct child elements under the body label
Siblinga + sibl with body > * INGB: Find the first sibling element B before the A element, such as: Div.head + div
Siblinga ~ siblingx: Find the sibling x elements before the A element, such as: H1 ~ P
El, El, el: multiple selector combinations, Finds a unique element that matches any selector, such as Div.masthead, Div.logo

Pseudo Selector Selectors
: LT (n): finds which element's sibling index value (its position is relative to its parent node in the DOM tree) is less than n, for example: Td:lt (3) represents an element less than three columns
: GT (N): Find which elements have a sibling index value greater than N, for example: Div p:gt (2) indicates which Div contains more than 2 p elements
: EQ (n): Find which elements have the same sibling index value as N, for example: Form Input:eq (1) represents a form element that contains an input label
: Has (Seletor): Finds elements that match selectors contain elements, such as: Div:has (p) indicates which div contains the P element
: Not (selector): Find elements that do not match the selector, such as: Div:not (. logo) represents all Div lists that do not contain class=logo elements
: Contains (text): Find the element that contains the given text, the search does not distinguish between large and not write, such as: P:contains (Jsoup)
: Containsown (text): Find elements that directly contain the given text
: Matches (regex): Find which elements of text match the specified regular expression, such as: Div:matches ((? i) login)
: Matchesown (Regex): Find an element that itself contains a text matching specified regular expression
Note: The above pseudo selector index starts at 0, which means that the first element index value is 0, the second element is index 1, and so on
You can view the Selector API reference for more detailed information


Extract attributes from elements, text, and HTML

Problem
After parsing gets a document instance object and finds some elements, you want to get the data in those elements.

Method
To get the value of a property, you can use the Node.attr (String key) method
For text in an element, you can use the Element.text () method
For HTML content in an element or attribute, you can use the element.html (), or the node.outerhtml () method
Example:

Copy Code code as follows:

String html = "<p>an <a href= ' http://www.jb51.net/' ><b>www.jb51.net</b></a> link.</ P> ";
Document DOC = Jsoup.parse (HTML);//Parse HTML string returns a document implementation
Element link = doc.select ("a"). First ();

String text = Doc.body (). text (); "An www.jb51.net link"//Get text in string
String linkhref = link.attr ("href"); "http://www.jb51.net/"//Get link address
String LinkText = Link.text (); "Www.jb51.net"//Get the text in the link address

String Linkouterh = link.outerhtml ();
"<a href=" http://www.jb51.net "><b>www.jb51.net</b></a>"
String Linkinnerh = link.html (); "<b>www.jb51.net</b>"//Get the HTML content within the link

Description
The above approach is the core approach to element data access. In addition, there are several other ways to use:

Element.id ()
Element.tagname ()
Element.classname () and Element.hasclass (String className)
These accessor methods have corresponding setter methods to change the data.


Sample program: Get all Links
This sample program will show you how to get a page from a URL. It then extracts all the links, pictures, and other ancillary content from the page. and check URLs and textual information.
Run the following program to specify a URL as an argument

Copy Code code as follows:

Package org.jsoup.www.jb51.nets;

Import Org.jsoup.Jsoup;
Import Org.jsoup.helper.Validate;
Import org.jsoup.nodes.Document;
Import org.jsoup.nodes.Element;
Import org.jsoup.select.Elements;

Import java.io.IOException;

/**
* Www.jb51.net to list links from a URL.
*/
public class Listlinks {
public static void Main (string[] args) throws IOException {
Validate.istrue (Args.length = = 1, "Usage:supply URL to fetch");
String URL = args[0];
Print ("Fetching%s ...", url);

Document doc = jsoup.connect (URL). get ();
Elements links = doc.select ("a[href]");
Elements media = doc.select ("[src]");
Elements imports = Doc.select ("link[href]");

Print ("\nmedia: (%d)", media.size ());
for (Element Src:media) {
if (Src.tagname (). Equals ("img"))
Print ("*%s: <%s>%sx%s (%s)",
Src.tagname (), Src.attr ("Abs:src"), src.attr ("width"), src.attr ("height"),
Trim (src.attr ("Alt"), 20));
Else
Print ("*%s: <%s>", Src.tagname (), Src.attr ("abs:src"));
}

Print ("\nimports: (%d)", imports.size ());
for (Element link:imports) {
Print ("*%s <%s> (%s)", Link.tagname (), Link.attr ("Abs:href"), Link.attr ("rel"));
}

Print ("\nlinks: (%d)", links.size ());
for (Element link:links) {
Print ("* A: <%s> (%s)", Link.attr ("Abs:href"), Trim (Link.text (), 35));
}
}

private static void print (String msg, Object ... args) {
System.out.println (String.Format (msg, args));
}

private static string Trim (string s, int width) {
if (S.length () > width)
Return s.substring (0, width-1) + ".";
Else
return s;
}
}
Org/jsoup/www.jb51.nets/listlinks.java

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.