Http://www.cnblogs.com/xiaoMzjm/p/3899366.html
Background
In the previous post Java Crawl Web page content Simple example (1)-using regular expressions inside, describes how to use regular expressions to parse the content of the Web page, although the regular expression is more general, but cumbersome, the amount of code, in the real world want to think of a simple regular expression for people who do not have a good regular expression base --for example, I t_t--is a very difficult thing. In this article, we use Jsoup, a powerful parsing HTML tool, to parse the HTML, you will find that everything becomes easy.
"Ready to Work"
Download: Jsoup-1.6.1.jar
"Look at the effect first"
Target site: China weather
Purpose: Get today's weather
Target HTML code:
View Code
Parsed Java code:
(1) After reviewing the elements of the webpage, we find that the content we want is in the target HTML code above, and the whole page is in the <li> of class= "dn on" data-dn= "7d1".
(2) The word "Today" is in
(3) The words "8th" in
(4) "Thunder" Three words in class= "WEA"
(5) "33" in the first <span>
(6) "25" in a second <span>
(7) Two words "Breeze" in the third <i>
With the above analysis, it is easy to get these weather content. The following Java code:
View Code
Results Print out:
1 today 2 8th 3 thundershowers 4 33°c5 25°c6 Breeze
Detailed
Report:
- Jsoup's official Chinese document is: http://www.open-open.com/
- The API is: http://jsoup.org/apidocs/
Java Code line 13th:
As we can see from the documentation, there are three ways to get a data source:
(1) Get from an HTML code string:Document doc = Jsoup.parse(html);
(2) Get from a URL:Document doc = Jsoup.connect("http://example.com/").get();
(3) Get from an HTML fileFile input = new File("/tmp/input.html");
Document doc = Jsoup.parse(input, "UTF-8", "http://example.com/");
Here, we take the second method, which is obtained from the URL.
Java code 24th, 26, 30, 34, 38 lines:
Document inherits from the element class, and the element class has a good method called Select, which is almost omnipotent. Quickly get the paragraph we want from a bunch of HTML code, and I think using Select is the most convenient. Let's look at how to use the Select method to find.
Note: The results of the following tables are printed using the following statement
for (Element e:elements) { System.out.println (E.text ()); }
Select detailed |
Describe |
The HTML code of the test |
Select wording |
Results |
Pass Label Signature To find |
<span>33</span><span>25</span> |
Elements Elements = Doc.select ("span"); Note: It is ok to write "tag name" directly by label, no angle brackets are required. |
3325 |
Pass Id To find |
<span id=\ "myspan\" >36</span><span>20</span> |
Elements Elements = Doc.select ("#mySpan"); Note: Use the ID to find, using the same method as the CSS specified elements, with # |
36 |
Pass class name to find |
<span class=\ "Myclass\" >36</span> <span>20</span> |
Elements Elements = Doc.select (". MyClass"); Note: To find by ID, use the same method as the CSS to specify the element . |
36 |
The use of tags within Property name Find Element |
<span class=\ "class1\" id=\ "id1\" >36</span> <span class=\ "class2\" id=\ "id2\" >36</span> |
Elements Elements = Doc.select ("span[class=class1]span[id=id1]"); Note: The rule is labeled "Property name = attribute Value", the label name can be written without writing, multiple attributes are multiple "", as above. |
36 |
The use of tags within Property name Prefix Find element |
<span class=\ "class1\" >36</span><span class=\ "class2\" >22</span> |
Elements Elements = Doc.select ("span[^cl]"); Note: The rule is labeled "^ Attribute name prefix", the label name can be written without writing, multiple attributes are multiple "". |
3622 |
The use of tags within Property name + Regular expression Find element |
<span class=\ "abc\" >36</span><span class=\ "ade\" >22</span> |
Elements Elements = Doc.select ("Span[class~=^ab]"); Note: The rule is the signature "attribute name ~= Regular expression", the above regular expression means looking for a label with the class value starting with AB, the label name can be written without writing, multiple properties are multiple "" |
36 |
Using labels Text contains some content To find |
<span>36</span><span>22</span> |
Elements Elements = Doc.select ("Span:contains (3)"); Note: The rule is tag name: Contains (text value) |
36 |
Using labels Text contains some content + regular expressions To find |
<span>36</span><span>22</span> |
Elements Elements = Doc.select ("Span:matchesown (^3)"); Note: The rule is tag name: Matchesown (regular expression), the above regular table means a label with a text value beginning with 3 |
36 |
There are a few other ways to find out about Select, which lists only the most useful and commonly used syntax.
The Select method returns a elements object with all the nodes found in the bread. Traversing the elements, through get (index), you can come up with a specific node. The text value can be taken out by using the node's text () method.
For other properties of the node, you can see the API's introduction.
Conclusion
Jsoup also has other powerful features, where it only describes how to get specific content on a webpage. Hope to be helpful to those who have just contacted Jsoup.