Android real--jsoup implementation of web crawler, embarrassing encyclopedia project start

Source: Internet
Author: User

This article covers the following topics:

  • Objective
  • Jsoup's introduction
  • Configuration of the Jsoup
  • Use of Jsoup
  • Conclusion

What's the biggest worry for Android beginners when they want to do a project? There is no doubt that the lack of data sources, of course, can choose the third-party interface to provide data, you can use the web crawler to obtain data, so that no third-party data as support. Originally intended to crawl some shopping site data, because their anti-climb do well, so there is no way to crawl data, can only crawl the data of embarrassing encyclopedia, perhaps smart you will think of high imitation of a embarrassing thing encyclopedia as their practiced hand project, the use of Jsoup is completely no problem

Jsoup Learning needs to combine the basic knowledge of the front-end, crawl the front-end data, if you learn JS, then you can use the framework without looking at the document yourself, because its design and JS use almost the same, nonsense not to say, drive

Use Project words: Jsoup is a Java library to handle the actual HTML. It provides a very handy API to extract and manipulate data, using the best dom,css and Jquery-like methods

Project Address: Https://github.com/jhy/jsoup
English Document: http://www.open-open.com/jsoup/

The Jsoup configuration is simple and requires the following dependencies to be added to the Gradle

Since Jsoup needs to get network data, remember to add network permissions

First, get the HTML

Jsoup provides two kinds of network requests, get and post, using code and its simplicity, we first crawl the embarrassing Encyclopedia of the home page of HTML. Note: Because it is a network request operation, it must be run in a sub-thread, otherwise more than 4.4 of the version will be error

①get Way

②post Way

Here's a description of the post parameters

    • Connect: Set the URL of the connection
    • Data: Sets the key value pair of post
    • UserAgent: Set up the user agent (the request header thing, can determine whether you are the PC or the mobile side)
    • Cookies: Setting up the cache
    • Timeout: Set Request timeout
    • Post: Send POST request

Now that you've got the Document object for HTML, it's time to parse the HTML element.

Second, get the HTML element

① Web page End

In the case of embarrassing encyclopedia, we look at what the HTML elements of the data for the home page of the embarrassing encyclopedia are, and we can find the corresponding HTML element by F12.

You can see that a tag is the content of the article, we can get the link through the class= "Contentherf" as a unique identifier of the A-tag, and then continue to crawl the details of the detail page, so we get to the details page of the article by crawling the A-tag link

Of course there are some detail pages with pictures, we can use the image of the class= "thumb" as a unique identifier to crawl the image inside the link

Because the embarrassing encyclopedia of the use of page loading, we need to crawl through the first content, then crawl the second chapter of the content, the following is embarrassing the Wikipedia page URL rules, very simple, we can pass a loop on the

Well, after analyzing the Web page, we should use the code on our Android side to implement the above steps.

②android End

Through the above analysis, we can summarize the steps we need to implement are:

    1. Crawl page's detail page URL
    2. Go to the Details page crawl content and pictures
    3. Loop Crawl second page, third page ...

Smart you, may think of fourth step fifth step ...

    1. Encapsulating Bean Objects
    2. Populating content with a ListView
    3. Crawl dates, authors, reviews, etc. complete the project

1) crawl page details page URL

The URL of the crawl home page can be through the class= "Contentherf" of the A tag, we implement it through the Jsoup property selector, this will use CSS knowledge, Jsoup Chinese document also has a very detailed introduction

Here is an introduction to the objects used

    • Document: Equivalent to an HTML file
    • Elements: Equivalent to a collection of labels
    • Element: equivalent to a label

Note here that the ToString () method and the text () method of the elements and element

    • ToString (): The HTML content of the label is printed
    • Text (): Print out the corresponding textual content of the label

CSS Selector

    • Select (): Gets the label content that meets the property selector requirements
    • or getElementById: Get the label content that meets the ID selector requirements
    • or getelementsbytag: Get tag content that complies with tag Selector requirements

2) go to the details page to crawl content and pictures

This code is quite simple, and there's not much to explain.

3) Cycle Crawl second page, third page ...

Here you only need to nest a loop in it, the complete code is as follows

4) Of course, after we crawl the content, there is no doubt is to encapsulate the object, through the ArrayList storage, so that your data source to solve the

5) After crawling to the author, date, comments and other information by you to practice, IELTS 7 points How difficult then the interface imitation, the project came out

Third, crawl results

Although the network crawler has brought a lot of data source problems, but many sites have been through some technology to achieve anti-crawler effect, so we still learn jsoup-based, whether it is Android or web-side jsoup is very useful, so it is necessary to master it, I heard that watercress and know all can climb out oh, want to do the project students can go to try Oh

Android real--jsoup implementation of web crawler, embarrassing encyclopedia project start

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.