This article covers the following topics:
- Objective
- Jsoup's introduction
- Configuration of the Jsoup
- Use of Jsoup
- Conclusion
What's the biggest worry for Android beginners when they want to do a project? There is no doubt that the lack of data sources, of course, can choose the third-party interface to provide data, you can use the web crawler to obtain data, so that no third-party data as support. Originally intended to crawl some shopping site data, because their anti-climb do well, so there is no way to crawl data, can only crawl the data of embarrassing encyclopedia, perhaps smart you will think of high imitation of a embarrassing thing encyclopedia as their practiced hand project, the use of Jsoup is completely no problem
Jsoup Learning needs to combine the basic knowledge of the front-end, crawl the front-end data, if you learn JS, then you can use the framework without looking at the document yourself, because its design and JS use almost the same, nonsense not to say, drive
Use Project words: Jsoup is a Java library to handle the actual HTML. It provides a very handy API to extract and manipulate data, using the best dom,css and Jquery-like methods
Project Address: Https://github.com/jhy/jsoup
English Document: http://www.open-open.com/jsoup/
The Jsoup configuration is simple and requires the following dependencies to be added to the Gradle
Since Jsoup needs to get network data, remember to add network permissions
First, get the HTML
Jsoup provides two kinds of network requests, get and post, using code and its simplicity, we first crawl the embarrassing Encyclopedia of the home page of HTML. Note: Because it is a network request operation, it must be run in a sub-thread, otherwise more than 4.4 of the version will be error
①get Way
②post Way
Here's a description of the post parameters
- Connect: Set the URL of the connection
- Data: Sets the key value pair of post
- UserAgent: Set up the user agent (the request header thing, can determine whether you are the PC or the mobile side)
- Cookies: Setting up the cache
- Timeout: Set Request timeout
- Post: Send POST request
Now that you've got the Document object for HTML, it's time to parse the HTML element.
Second, get the HTML element
① Web page End
In the case of embarrassing encyclopedia, we look at what the HTML elements of the data for the home page of the embarrassing encyclopedia are, and we can find the corresponding HTML element by F12.
You can see that a tag is the content of the article, we can get the link through the class= "Contentherf" as a unique identifier of the A-tag, and then continue to crawl the details of the detail page, so we get to the details page of the article by crawling the A-tag link
Of course there are some detail pages with pictures, we can use the image of the class= "thumb" as a unique identifier to crawl the image inside the link
Because the embarrassing encyclopedia of the use of page loading, we need to crawl through the first content, then crawl the second chapter of the content, the following is embarrassing the Wikipedia page URL rules, very simple, we can pass a loop on the
Well, after analyzing the Web page, we should use the code on our Android side to implement the above steps.
②android End
Through the above analysis, we can summarize the steps we need to implement are:
- Crawl page's detail page URL
- Go to the Details page crawl content and pictures
- Loop Crawl second page, third page ...
Smart you, may think of fourth step fifth step ...
- Encapsulating Bean Objects
- Populating content with a ListView
- Crawl dates, authors, reviews, etc. complete the project
1) crawl page details page URL
The URL of the crawl home page can be through the class= "Contentherf" of the A tag, we implement it through the Jsoup property selector, this will use CSS knowledge, Jsoup Chinese document also has a very detailed introduction
Here is an introduction to the objects used
- Document: Equivalent to an HTML file
- Elements: Equivalent to a collection of labels
- Element: equivalent to a label
Note here that the ToString () method and the text () method of the elements and element
- ToString (): The HTML content of the label is printed
- Text (): Print out the corresponding textual content of the label
CSS Selector
- Select (): Gets the label content that meets the property selector requirements
- or getElementById: Get the label content that meets the ID selector requirements
- or getelementsbytag: Get tag content that complies with tag Selector requirements
2) go to the details page to crawl content and pictures
This code is quite simple, and there's not much to explain.
3) Cycle Crawl second page, third page ...
Here you only need to nest a loop in it, the complete code is as follows
4) Of course, after we crawl the content, there is no doubt is to encapsulate the object, through the ArrayList storage, so that your data source to solve the
5) After crawling to the author, date, comments and other information by you to practice, IELTS 7 points How difficult then the interface imitation, the project came out
Third, crawl results
Although the network crawler has brought a lot of data source problems, but many sites have been through some technology to achieve anti-crawler effect, so we still learn jsoup-based, whether it is Android or web-side jsoup is very useful, so it is necessary to master it, I heard that watercress and know all can climb out oh, want to do the project students can go to try Oh
Android real--jsoup implementation of web crawler, embarrassing encyclopedia project start