Java Crawl Web page content Simple example (2)--with Jsoup's Select usage detailed

Source: Internet
Author: User
Tags tag name

Http://www.cnblogs.com/xiaoMzjm/p/3899366.html

Background

In the previous post Java Crawl Web page content Simple example (1)-using regular expressions inside, describes how to use regular expressions to parse the content of the Web page, although the regular expression is more general, but cumbersome, the amount of code, in the real world want to think of a simple regular expression for people who do not have a good regular expression base --for example, I t_t--is a very difficult thing. In this article, we use Jsoup, a powerful parsing HTML tool, to parse the HTML, you will find that everything becomes easy.

"Ready to Work"

Download: Jsoup-1.6.1.jar

"Look at the effect first"

Target site: China weather

Purpose: Get today's weather

Target HTML code:

View Code

Parsed Java code:

(1) After reviewing the elements of the webpage, we find that the content we want is in the target HTML code above, and the whole page is in the <li> of class= "dn on" data-dn= "7d1".

(2) The word "Today" is in

(3) The words "8th" in

(4) "Thunder" Three words in class= "WEA"

(5) "33" in the first <span>

(6) "25" in a second <span>

(7) Two words "Breeze" in the third <i>

With the above analysis, it is easy to get these weather content. The following Java code:

View Code

Results Print out:

1 today 2 8th 3 thundershowers 4 33°c5 25°c6 Breeze

Detailed

Report:

    • Jsoup's official Chinese document is: http://www.open-open.com/
    • The API is: http://jsoup.org/apidocs/

Java Code line 13th:

As we can see from the documentation, there are three ways to get a data source:

(1) Get from an HTML code string:Document doc = Jsoup.parse(html);

(2) Get from a URL:Document doc = Jsoup.connect("http://example.com/").get();

(3) Get from an HTML fileFile input = new File("/tmp/input.html");    Document doc = Jsoup.parse(input, "UTF-8", "http://example.com/");

Here, we take the second method, which is obtained from the URL.

Java code 24th, 26, 30, 34, 38 lines:

Document inherits from the element class, and the element class has a good method called Select, which is almost omnipotent. Quickly get the paragraph we want from a bunch of HTML code, and I think using Select is the most convenient. Let's look at how to use the Select method to find.

Note: The results of the following tables are printed using the following statement

for (Element e:elements) {            System.out.println (E.text ());        }

Select detailed
Describe The HTML code of the test Select wording

Results

Pass

Label Signature

To find

<span>33</span><span>25</span>

Elements Elements = Doc.select ("span");

Note: It is ok to write "tag name" directly by label, no angle brackets are required.

3325

Pass

Id

To find

<span  id=\ "myspan\" >36</span><span>20</span>

Elements Elements = Doc.select ("#mySpan");

Note: Use the ID to find, using the same method as the CSS specified elements, with #

36

Pass

class name

to find

<span class=\ "Myclass\" >36</span>
<span>20</span>

Elements Elements = Doc.select (". MyClass");

Note: To find by ID, use the same method as the CSS to specify the element .

36

The use of tags within

Property name

Find Element

<span class=\ "class1\" id=\ "id1\" >36</span>
<span class=\ "class2\" id=\ "id2\" >36</span>

Elements Elements = Doc.select ("span[class=class1]span[id=id1]");

Note: The rule is labeled "Property name = attribute Value", the label name can be written without writing, multiple attributes are multiple "", as above.

36

The use of tags within

Property name Prefix

Find element

<span class=\ "class1\" >36</span><span class=\ "class2\" >22</span>

Elements Elements = Doc.select ("span[^cl]");

Note: The rule is labeled "^ Attribute name prefix", the label name can be written without writing, multiple attributes are multiple "".

3622

The use of tags within

Property name + Regular expression

Find element

<span class=\ "abc\" >36</span><span class=\ "ade\" >22</span>

Elements Elements = Doc.select ("Span[class~=^ab]");

Note: The rule is the signature "attribute name ~= Regular expression", the above regular expression means looking for a label with the class value starting with AB, the label name can be written without writing, multiple properties are multiple ""

36

Using labels

Text contains some content

To find

<span>36</span><span>22</span>

Elements Elements = Doc.select ("Span:contains (3)");

Note: The rule is tag name: Contains (text value)

36

Using labels

Text contains some content + regular expressions

To find

<span>36</span><span>22</span>

Elements Elements = Doc.select ("Span:matchesown (^3)");

Note: The rule is tag name: Matchesown (regular expression), the above regular table means a label with a text value beginning with 3

36

There are a few other ways to find out about Select, which lists only the most useful and commonly used syntax.

The Select method returns a elements object with all the nodes found in the bread. Traversing the elements, through get (index), you can come up with a specific node. The text value can be taken out by using the node's text () method.

For other properties of the node, you can see the API's introduction.

Conclusion

Jsoup also has other powerful features, where it only describes how to get specific content on a webpage. Hope to be helpful to those who have just contacted Jsoup.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.