"Reprint" Using Jsoup parsing HTML page

Source: Internet
Author: User

I. Introduction of Jsoup

In the past, when parsing HTML documents or fragments with Java, we usually use the Htmlparser (http://htmlparser.sourceforge.net/) Open source class library. Now we have jsoup, the future processing of HTML content only need to use Jsoup is enough, jsoup have faster updates, more convenient API and so on.

Jsoup is a Java HTML parser that can parse a URL address and HTML text content directly. It provides a very labor-saving API that can be used to extract and manipulate data through dom,css and jquery-like operations, as a Java version of jquery.

The main functions of Jsoup are as follows:

    • Parsing html from a URL, file, or string;
    • Use the DOM or CSS selector to find and remove data;
    • Can manipulate HTML elements, attributes, text;

Jsoup is based on the MIT protocol and can be used with confidence in commercial projects. Official website: http://jsoup.org/

Second, parse traverse HTML document

Jsoup processing an HTML file is the process of converting a user-entered HTML document, parsing, into a document object. Jsoup generally supports conversion of the following source content.

    • Parse an HTML string
    • Parse a body fragment
    • Loads a document object based on a URL address
    • To load a document object from a file
(i) parsing an HTML string

In the processing of an HTML string. We may need to parse it and extract its contents, or verify that it is in a complete format, or that you want to modify it. Jsoup can help us to solve these problems easily.

In Jsoup, there is a static method Jsoup.parse (String html) that converts our HTML fragment to a Document object. Examples are as follows:

Help
1 2 String html = "<div><p align=\"center\">这是P元素的内容</p>"; Document document = Jsoup.parse(html);

Using the method above, you can convert the HTML string into a Document object, and once you have the Document object, we can use the appropriate method to handle the problem on demand. We can see that the converted HTML fragment is not a valid HTML fragment, and the DIV tag is not closed. This is not a problem for jsoup, it can handle this kind of problem very well.

(ii) Parsing body fragments

Suppose we now have an HTML fragment (for example. A div contains a pair of P tags; an incomplete HTML document) to parse it. This HTML fragment can be a user-submitted comment or edit the body section on a CMS page. We can use the Jsoup.parsebodyfragment (String html) method.

Examples are as follows:

Help
1 2 String html = "<div><p align=\"center\">这是P元素的内容</p>"; Document document = Jsoup.parseBodyFragment(html);

There may be a question here, and this is the same as the HTML snippet above. Yes, it's the same, Parsebodyfragment method creates an empty shell document and inserts the parsed HTML into the BODY element. If you use the normal jsoup.parse (String html) method, you can usually get the same result, but explicitly treat the user input as a body fragment to ensure that any bad HTML provided by the user will be parsed into the BODY element.

The Document.body () method gets all the child elements of the BODY element of the document, the same as the Doc.getelementsbytag ("body").

(iii) Loading a Document object based on a URL address

Sometimes we may want to use a URL address and then extract the content inside to convert it into a Document object. We may have used the HTTP client to simulate a request, and then get back content, and so on, using jsoup easy to solve the problem. Examples are as follows:

Help
1 2 3 Document document = Jsoup.connect("http://www.baidu.com").get(); String title = document.title(); String text = document.text();

The Connect (String URL) method creates a new Connection, and get () gets an HTML file that is reconciled. If an error occurs when getting HTML from this URL, IOException is thrown and should be handled appropriately.

The Connection interface also provides a method chain to resolve special requests, as follows:

Help
1 Document doc = Jsoup.connect("http://test.com").data("query", "Java").userAgent("Mozilla").cookie("auth", "token").timeout(3000).post();

You can post parameters to the link address, set up useragent,cookie,timeout, and so on, and it is convenient to use the link operation (familiar with jquery should be familiar with such a link operation).

(iv) Loading document according to the document

Sometimes we have to deal with the HTML content, maybe there is a file on the hard disk, we need to extract or parse some content from it, we can do this through Jsoup. The sample code is as follows:

Help
1 2 File input = new File("d:/input.html"); Document doc = Jsoup.parse(input, "UTF-8", "http://test.com/");

See here may have a question, the first parameter is the file, the second is the code, and the third is what? The third parameter is BaseURL, using him we can easily deal with relative path problems, if not necessary can not pass, this is a polymorphic method, in the previous three parts, Can add another such baseurl, which will be described in detail later.

Data extraction (i) traversing documents using DOM methods

In the second chapter we can get an object of document, which we can use to traverse documents such as:

Help
1 2 3 4 5 6 7 Document doc = Jsoup.parse(input, "UTF-8", "http://test.com/"); Element content = doc.getElementById("content"); Elements links = content.getElementsByTag("a"); for(Element link : links) { String linkHref = link.attr("href"); String linkText = link.text(); }

Here we can easily use the Doument object method to get the content. Common methods are as follows:

Find element

    • getElementById (String ID)
    • Getelementsbytag (String tag)
    • Getelementsbyclass (String className)
    • Getelementsbyattribute (String key) (and related methods)
    • Element siblings:siblingelements (), firstelementsibling (), lastelementsibling (); nextelementsibling (), Previouselementsibling ()
    • Graph:parent (), children (), child (int index)

Element data

    • attr (string key) Get property attr (String key, String value) Set property
    • Attributes () Get all properties
    • ID (), className () and Classnames ()
    • Text () Gets the textual content text (String value) sets the textual content
    • HTML () Gets the HTML content within the element of the htmlhtml (String value) setting element
    • outerHTML () Get out-of-element HTML content
    • Data (for example: script and style tags)
    • Tag () and TagName ()

Manipulating HTML and text

    • Append (string html), prepend (string html)
    • AppendText (string text), Prependtext (string text)
    • Appendelement (String tagName), Prependelement (string tagName)
    • HTML (String value)
(ii) Use selectors to find elements

With jquery, we are all amazed by its powerful selectors, and Jsoup has the same powerful selector that makes it easy for us to process our documents. The sample code is as follows:

Help
1 2 3 4 5 6 Elements links = doc.select("a[href]"); //带有<span style="text-decoration: underline;">href</span>属性的a元素 Elements pngs = doc.select("img[src$=.png]"); //扩展名为.<span style="text-decoration: underline;">png</span>的图片 Element masthead = doc.select("div.masthead").first(); //class等于<span style="text-decoration: underline;">masthead</span>的<span style="text-decoration: underline;">div</span>标签 Elements resultLinks = doc.select("h3.r > a"); //在h3元素之后的a元素

The Jsoup elements object supports selector syntax similar to CSS (or jquery) to achieve very powerful and flexible search functionality.

This select method can be used in document, Element, or elements objects. and is context-sensitive, so filtering of the specified element can be implemented, or a chain-selectable access.

The Select method returns a elements collection and provides a set of methods to extract and manipulate the results.

(iii) Extracting attributes and documents from elements

Using Jsoup to extract properties, the general method is as follows:

    • To get the value of a property, you can use the Node.attr (String key) method
    • For text in an element, you can use the Element.text () method
    • For HTML content in an element or attribute, you can use the element.html (), or the node.outerhtml () method

Examples are as follows:

Help
01 02 03 04 05 06 07 08 09 10 11 String html = "<p>An <a href=‘http://example.com/‘><b>example</b></a> link.</p>"; Document doc = Jsoup.parse(html);//解析HTML字符串返回一个Document实现 Element link = doc.select("a").first();//查找第一个a元素</pre>String text = doc.body().text(); // "An example link"//取得字符串中的文本 String linkHref = link.attr("href"); // "http://example.com/"//取得链接地址 String linkText = link.text(); // "example""//取得链接地址中的文本</pre>String linkOuterH = link.outerHtml(); // "<a href="http://example.com"><b>example</b></a>" String linkInnerH = link.html(); // "<b>example</b>"//取得链接内的html内容
(iv) URL processing

When we are working with HTML content, we may often encounter this problem by translating the link address from the HTML page to the absolute address from the relative address, Jsoup has a method to solve the problem. The BaseURL we faced before was used to solve the problem. The sample code is as follows:

Help
1 2 3 4 5 document doc = jsoup.connect ( "http.// www.baidu.com/" ). get (); element link = doc.select ( ). First (); string relhref = link.attr ( ); //= = "/" string abshref = link.attr ( "Abs:href" ); //"http://www.baidu.com/gaoji/preferences.html"

In HTML elements, URLs are often written relative to the location of the document: <a href= "/download" >...</A>. When you use the Node.attr (String key) method to obtain the href attribute of the A element, it returns the specified value directly in the HTML source.

If you need to get an absolute path, you need to add the ABS: prefix to the property name. This will return the URL address attr ("Abs:href") that contains the root path.

Therefore, it is important to define a base URI when parsing an HTML document. If you do not want to use the ABS: prefix, there is also a way to implement the same function Node.absurl (String key).

Iv. Data modification (i) Setting property values

When working with HTML, we may sometimes need to modify attributes such as property values, slice addresses, class names, and so on.

You can use the properties to set the method Element.attr (string key, String value), and elements.attr (string key, String value).

If you need to modify the class property of an element, you can use the Element.addclass (string className) and Element.removeclass (String className) methods.

Elements provides a way to manipulate element properties and classes in bulk, such as to add a rel= "nofollow" to each a element in the Div.
You can use the following methods:

Help
1 doc.select("div.comments a").attr("rel", "nofollow");

The Jsoup method here also supports link operations, as follows:

Help
1 doc.select("div.masthead").attr("title", "jsoup").addClass("round-box");
(ii) Set the HTML content of the element

We need to add content such as HTML fragments to HTML to do the following:

Help
1 2 3 4 5 6 7 8 9 Element div = doc.select("div").first(); // <div></div> div.html("<p>lorem ipsum</p>"); // <div><p>lorem ipsum</p></div> div.prepend("<p>First</p>");//在div前添加html内容 div.append("<p>Last</p>");//在div之后添加html内容 // 添完后的结果: <div><p>First</p><p>lorem ipsum</p><p>Last</p></div> Element span = doc.select("span").first(); // <span>One</span> span.wrap("<li><a href=‘http://example.com/‘></a></li>"); //对元素包裹一个外部HTML内容添完后的结果: //<li><a href="http://example.com"><span>One</span></a></li>
(iii) Setting the text content of an element

If we need to modify the text content within an element, you can do the following:

Help
1 2 3) 4 5 Element div = doc.select("div").first(); // <div></div> div.text("five > four"); // <div>five &gt; four</div> div.prepend("First "); div.append(" Last");// now: <div>First five &gt; four Last</div>

Description

The text setting method is the same as the HTML setter method:

Element.text (String text) clears the inner HTML content of an element, and then provides the text instead

Element.prepend (string first) and Element.append (string last) will add text nodes before and after the inner HTML of the element.

For incoming text if it contains characters like <, >, etc., it will be processed in text, not HTML.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.