Jsoup Parsing HTML

Source: Internet
Author: User
Tags baseuri nets tagname

Parsing and traversing an HTML document

how to parse an HTML document :

The code is as follows:


String html = "+ "<body><p>parsed HTML into a doc.</p></body>Document doc = jsoup.parse (HTML);

Its parser can do everything possible from the HTML document you provide to transcend a clean parsing result, regardless of whether the HTML format is complete. For example, it can handle:

1, no closed tags (such as: <p>lorem <p>ipsum parses to <p>Lorem</p> <p>Ipsum</p>)
2, the implicit label (for example, it can automatically package <td>table data</td> into <table><tr><td>?)
3. Create a reliable document structure (HTML tags contain head and body, only the right elements appear in the head)

Object model for a document

1, the document consists of multiple elements and textnodes (and other auxiliary nodes).
2, its inheritance structure is as follows: Document inheritance element Inherits node. Textnode inherits Node.
3. An element contains a collection of child nodes and has a parent element. They also provide a unique sub-element filter list.

Loading a document from a URL

There is a problem
You need to get and parse an HTML document from a Web site and find the relevant data in it. You can use the following workaround:

Workaround
Use the Jsoup.connect (String url) method:

Copy CodeThe code is as follows:
Document doc = Jsoup.connect ("http://www.jb51.net/"). get ();
String title = Doc.title ();

Description
The Connect (String URL) method creates a new Connection, and get () gets an HTML file that is reconciled. If an error occurs when getting HTML from this URL, IOException is thrown and should be handled appropriately.

The Connection interface also provides a method chain to resolve special requests, as follows:

Copy CodeThe code is as follows:
Document doc = Jsoup.connect ("Http://www.jb51.net")
. Data ("Query", "Java")
. useragent ("Mozilla")
. Cookie ("auth", "token")
. Timeout (3000)
. Post ();

This method only supports Web URLs (HTTP and HTTPS protocols); If you need to load from a file, you can use the parse (file in, String charsetname) instead.

Load a document from a file

Problem
There is an HTML file on your native hard disk that you need to parse to extract data from or modify.

Way
You can use static Jsoup.parse (File in, String CharsetName, String BaseUri) methods:

Copy CodeThe code is as follows:
File input = new file ("/tmp/input.html");
Document doc = jsoup.parse (input, "UTF-8", "http://www.jb51.net/");

Description
Parse (file in, String CharsetName, String BaseUri) This method is used to load and parse an HTML file. If an error occurs while loading the file, IOException will be thrown and should be handled appropriately.
The BaseUri parameter is used to resolve a problem where URLs in a file are relative paths. If you do not need to pass in an empty string.
There is also a method, parse (file in, String CharsetName), which uses the path of the file as BaseUri. This method applies if the parsed file is located on the local file system of the Web site, and the associated link also points to the file system.


Using DOM methods to traverse a document

Problem
You have an HTML document to extract data from and understand the structure of this HTML document.

Method
After parsing HTML into a document, you can manipulate it using a DOM-like approach. Example code:

Copy CodeThe code is as follows:
File input = new file ("/tmp/input.html");
Document doc = jsoup.parse (input, "UTF-8", "http://www.jb51.net/");

Element content = Doc.getelementbyid ("content");
Elements links = Content.getelementsbytag ("a");
for (Element link:links) {
String linkhref = link.attr ("href");
String LinkText = Link.text ();
}

Description
Elements This object provides a series of Dom-like methods to find elements, extract and manipulate the data in them. Specific as follows:
Find element
getElementById (String ID)
Getelementsbytag (String tag)
Getelementsbyclass (String className)
Getelementsbyattribute (String key) (and related methods)
Element siblings:siblingelements (), firstelementsibling (), lastelementsibling (); nextelementsibling (), Previouselementsibling ()
Graph:parent (), children (), child (int index)

Element data
attr (string key) Get property attr (String key, String value) Set property
Attributes () Get all properties
ID (), className () and Classnames ()
Text () Gets the textual content text (String value) sets the textual content
HTML () Gets the HTML content within the element of the htmlhtml (String value) setting element
outerHTML () Get out-of-element HTML content
Data (for example: script and style tags)
Tag () and TagName ()

Manipulating HTML and text
Append (string html), prepend (string html)
AppendText (string text), Prependtext (string text)
Appendelement (String tagName), Prependelement (string tagName)
HTML (String value)


Use selector syntax to find elements
Problem
You want to use CSS or jquery-like syntax to find and manipulate elements.

Method
Can be implemented using the Element.select (string selector) and Elements.select (string selector) methods:

Copy CodeThe code is as follows:
File input = new file ("/tmp/input.html");
Document doc = jsoup.parse (input, "UTF-8", "http://www.jb51.net./");

Elements links = doc.select ("a[href]"); A element with an HREF attribute
Elements PNGs = Doc.select ("img[src$=.png]");
Pictures with a. png extension

Element masthead = Doc.select ("Div.masthead"). First ();
Class equals Masthead div tag

Elements resultlinks = Doc.select ("H3.R > A"); The a element after the H3 element

Description
The Jsoup elements object supports selector syntax similar to CSS (or jquery) to achieve very powerful and flexible search functionality.
This select method can be used in document, Element, or elements objects. and is context-sensitive, so filtering of the specified element can be implemented, or a chain-selectable access.
The Select method returns a elements collection and provides a set of methods to extract and manipulate the results.

Selector Selector overview
TagName: Find elements by tags, such as: a
Ns|tag: Find elements in namespaces through tags, such as: You can find <fb:name> elements with fb|name syntax
#id: Find elements by ID, such as: #logo
. Class: Finds elements by class name, for example:. Masthead
[attribute]: Use attributes to find elements, such as: [href]
[^attr]: Use the attribute name prefix to find elements, such as: You can use [^data-] to find the element with the HTML5 dataset property
[Attr=value]: Use attribute values to find elements, such as: [width=500]
[Attr^=value], [Attr$=value], [Attr*=value]: Finds an element with a matching attribute value beginning, ending, or containing an attribute value, such as: [href*=/path/]
[Attr~=regex]: Use attribute values to match regular expressions to find elements, such as: img[src~= (? i) \. ( PNG|JPE?G)]
*: This symbol will match all elements

The

Selector selector combination uses the
el#id: element +id, such as: Div#logo
El.class: Element +class, for example: Div.masthead
El[attr ]: element +class, such as: A[href]
Any combination, such as: A[href].highlight
Ancestor Child: Find an element under the next element, such as: can use. body p to find all p elements under the "body" element
Parent > Child: Find the immediate sub-elements under a parent element, such as: You can find the P element with div.content > P, or you can use body > * To find all the immediate child elements under the body tag
Siblinga + Sibl INGB: Finds the first sibling element B before the A element, such as: Div.head + div
Siblinga ~ siblingx: Finds the sibling x element before the a element, such as: H1 ~ P
El, El, el: multiple selector combinations, Find the unique element that matches either selector, for example: Div.masthead, Div.logo

Pseudo Selector Selectors
: LT (n): finds which element's sibling index value (its position is relative to its parent node in the DOM tree) is less than n, for example: Td:lt (3) represents an element less than three columns
: GT (N): Find which elements have a sibling index value greater than N, for example: Div p:gt (2) indicates which Div contains more than 2 p elements
: EQ (n): Find which elements have the same sibling index value as N, for example: Form Input:eq (1) represents a form element that contains an input tag
: Has (Seletor): Finds elements that match selectors that contain elements, such as: Div:has (P), which div contains the P element
: Not (selector): Finds elements that do not match the selector, such as: Div:not (. logo) for all Div lists that do not contain class= "logo" elements
: Contains (text): Find the element containing the given text, search does not distinguish between large and non-written, such as: P:contains (Jsoup)
: Containsown (text): Find the element that directly contains the given text
: Matches (regex): finds which elements of text match the specified regular expression, such as: Div:matches ((? i) login)
: Matchesown (Regex): Find an element that itself contains text that matches a specified regular expression
Note: The above pseudo-selector index starts at 0, which means that the first element has an index value of 0, the second element is index 1, and so on
You can view the Selector API reference for more detailed information.


Extract attributes from elements, text and HTML

Problem
After parsing a document instance object and finding some elements, you want to get the data in those elements.

Method
To get the value of a property, you can use the Node.attr (String key) method
For text in an element, you can use the Element.text () method
For HTML content in an element or attribute, you can use the element.html (), or the node.outerhtml () method
Example:

Copy CodeThe code is as follows:
String html = "<p>an <a href= ' http://www.jb51.net/' ><b>www.jb51.net</b></a> link.</ P> ";
Document DOC = Jsoup.parse (HTML);//parsing HTML string returns a document implementation
Element link = doc.select ("a"). First ();//Search for element a

String text = Doc.body (). text (); "An www.jb51.net link"//get the text in the string
String linkhref = link.attr ("href"); "http://www.jb51.net/"//Get link address
String LinkText = Link.text (); "Www.jb51.net" "//Get the text in the link address

String Linkouterh = link.outerhtml ();
"<a href=" http://www.jb51.net "><b>www.jb51.net</b></a>"
String Linkinnerh = link.html (); "<b>www.jb51.net</b>"//Get HTML content within the link

Description
The above approach is the core approach to element data access. In addition, there are other ways to use:

Element.id ()
Element.tagname ()
Element.classname () and Element.hasclass (String className)
These accessor methods have the appropriate setter method to change the data.


Sample program: Get all Links
This sample program will show how to get a page from a URL. Then extract all links, pictures, and other ancillary content from the page. and check URLs and text messages.
Run the following program to specify a URLs as parameters

Copy CodeThe code is as follows:
Package org.jsoup.www.jb51.nets;

Import Org.jsoup.Jsoup;
Import Org.jsoup.helper.Validate;
Import org.jsoup.nodes.Document;
Import org.jsoup.nodes.Element;
Import org.jsoup.select.Elements;

Import java.io.IOException;

/**
* Www.jb51.net program to list links from a URL.
*/
public class Listlinks {
public static void Main (string[] args) throws IOException {
Validate.istrue (Args.length = = 1, "Usage:supply URL to fetch");
String URL = args[0];
Print ("Fetching%s ...", url);

Document doc = jsoup.connect (URL). get ();
Elements links = doc.select ("a[href]");
Elements media = doc.select ("[src]");
Elements imports = Doc.select ("link[href]");

Print ("\nmedia: (%d)", media.size ());
for (Element Src:media) {
if (Src.tagname (). Equals ("img"))
Print ("*%s: <%s>%sx%s (%s)",
Src.tagname (), Src.attr ("Abs:src"), src.attr ("width"), src.attr ("height"),
Trim (src.attr ("Alt"), 20));
Else
Print ("*%s: <%s>", Src.tagname (), Src.attr ("abs:src"));
}

Print ("\nimports: (%d)", imports.size ());
for (Element link:imports) {
Print ("*%s <%s> (%s)", Link.tagname (), Link.attr ("Abs:href"), Link.attr ("rel"));
}

Print ("\nlinks: (%d)", links.size ());
for (Element link:links) {
Print ("* A: <%s> (%s)", Link.attr ("Abs:href"), Trim (Link.text (), 35));
}
}

private static void print (String msg, Object ... args) {
System.out.println (String.Format (msg, args));
}

private static string Trim (string s, int width) {
if (S.length () > width)
Return s.substring (0, width-1) + ".";
Else
return s;
}
}
Org/jsoup/www.jb51.nets/listlinks.java

    • Android Jsoup get website content Android Get news Headlines instance
    • Android uses Jsoup to crawl page data
    • How to use Jsoup to parse HTML pages in Android development
    • Java Implementation Crawler provides data to app (Jsoup web crawler)
    • How Android uses Jsoup to parse HTML tables
    • Parsing HTML file instances using Open Source Library Jsoup in Java
    • CRAWLER4J Crawl page resolution when parsing HTML using Jsoup
    • Jsoup parsing HTML to implement Recruitment information query function

Jsoup Parsing HTML

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.