Android real--jsoup implementation of web crawler, embarrassing encyclopedia project start

Last Update:2018-09-07 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

This article covers the following topics:

Objective

Jsoup's introduction

Configuration of the Jsoup

Use of Jsoup

Conclusion

What's the biggest worry for Android beginners when they want to do a project? There is no doubt that the lack of data sources, of course, can choose the third-party interface to provide data, you can use the web crawler to obtain data, so that no third-party data as support. Originally intended to crawl some shopping site data, because their anti-climb do well, so there is no way to crawl data, can only crawl the data of embarrassing encyclopedia, perhaps smart you will think of high imitation of a embarrassing thing encyclopedia as their practiced hand project, the use of Jsoup is completely no problem

Jsoup Learning needs to combine the basic knowledge of the front-end, crawl the front-end data, if you learn JS, then you can use the framework without looking at the document yourself, because its design and JS use almost the same, nonsense not to say, drive

Use Project words: Jsoup is a Java library to handle the actual HTML. It provides a very handy API to extract and manipulate data, using the best dom,css and Jquery-like methods

Project Address: Https://github.com/jhy/jsoup
English Document: http://www.open-open.com/jsoup/

The Jsoup configuration is simple and requires the following dependencies to be added to the Gradle

Since Jsoup needs to get network data, remember to add network permissions

First, get the HTML

Jsoup provides two kinds of network requests, get and post, using code and its simplicity, we first crawl the embarrassing Encyclopedia of the home page of HTML. Note: Because it is a network request operation, it must be run in a sub-thread, otherwise more than 4.4 of the version will be error

①get Way

②post Way

Here's a description of the post parameters

Connect: Set the URL of the connection
Data: Sets the key value pair of post
UserAgent: Set up the user agent (the request header thing, can determine whether you are the PC or the mobile side)
Cookies: Setting up the cache
Timeout: Set Request timeout
Post: Send POST request

Now that you've got the Document object for HTML, it's time to parse the HTML element.

Second, get the HTML element

① Web page End

In the case of embarrassing encyclopedia, we look at what the HTML elements of the data for the home page of the embarrassing encyclopedia are, and we can find the corresponding HTML element by F12.

You can see that a tag is the content of the article, we can get the link through the class= "Contentherf" as a unique identifier of the A-tag, and then continue to crawl the details of the detail page, so we get to the details page of the article by crawling the A-tag link

Of course there are some detail pages with pictures, we can use the image of the class= "thumb" as a unique identifier to crawl the image inside the link

Because the embarrassing encyclopedia of the use of page loading, we need to crawl through the first content, then crawl the second chapter of the content, the following is embarrassing the Wikipedia page URL rules, very simple, we can pass a loop on the

Well, after analyzing the Web page, we should use the code on our Android side to implement the above steps.

②android End

Through the above analysis, we can summarize the steps we need to implement are:

Crawl page's detail page URL
Go to the Details page crawl content and pictures
Loop Crawl second page, third page ...

Smart you, may think of fourth step fifth step ...

Encapsulating Bean Objects
Populating content with a ListView
Crawl dates, authors, reviews, etc. complete the project

1) crawl page details page URL

The URL of the crawl home page can be through the class= "Contentherf" of the A tag, we implement it through the Jsoup property selector, this will use CSS knowledge, Jsoup Chinese document also has a very detailed introduction

Here is an introduction to the objects used

Document: Equivalent to an HTML file
Elements: Equivalent to a collection of labels
Element: equivalent to a label

Note here that the ToString () method and the text () method of the elements and element

ToString (): The HTML content of the label is printed
Text (): Print out the corresponding textual content of the label

CSS Selector

Select (): Gets the label content that meets the property selector requirements
or getElementById: Get the label content that meets the ID selector requirements
or getelementsbytag: Get tag content that complies with tag Selector requirements

2) go to the details page to crawl content and pictures

This code is quite simple, and there's not much to explain.

3) Cycle Crawl second page, third page ...

Here you only need to nest a loop in it, the complete code is as follows

4) Of course, after we crawl the content, there is no doubt is to encapsulate the object, through the ArrayList storage, so that your data source to solve the

5) After crawling to the author, date, comments and other information by you to practice, IELTS 7 points How difficult then the interface imitation, the project came out

Third, crawl results

Although the network crawler has brought a lot of data source problems, but many sites have been through some technology to achieve anti-crawler effect, so we still learn jsoup-based, whether it is Android or web-side jsoup is very useful, so it is necessary to master it, I heard that watercress and know all can climb out oh, want to do the project students can go to try Oh

Android real--jsoup implementation of web crawler, embarrassing encyclopedia project start

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Android real--jsoup implementation of web crawler, embarrassing encyclopedia project start

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Android real--jsoup implementation of web crawler, embarrassing encyclopedia project start

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support