That is, do not recharge the film VIP, nor go to the cinema, why the Python crawler is such a person?

Source: Internet
Author: User

Python multi-threaded crawl power of the Thunder, you can write a thunderbolt download program in, but not recommended, because the memory is too large.

Perhaps some of the Python crawler is not very familiar with friends, to see the small part of the blog is not harvested, then I first introduce the principle of reptiles.

Countless web addresses (URLs) are woven into a web, called a network. The crawler will carefully select some URLs as a starting point, starting from these starting points, crawling and parsing the crawled page, extracting the information needed in the page, and getting a new URL inserted into the queue as the starting point for the next crawl. This keeps looping until you get all the information you want.

This Python crawler implements the first step to analyze the homepage structure of the movie Paradise site.

Parse Home Address extract classification information

In this function, the first step is to download the HTML source of the Web page, the XPath parse out the menu classification information, and create the corresponding file directory.

Parse the home page for each category

Opening the first page of all categories you can see that all have an identical structure, first parse out the node that contains the resource URL, and then extract the name and URL.

Resolve resource Address to file

Save the extracted information to a folder, in order to improve the efficiency of the crawler, the use of Python multi-threaded crawl, here for all the classification of the main page opened a thread, greatly enhance the efficiency of the crawler.

Results of crawling

Folder classification

The text address and the corresponding movie name

Get text address when open

Python All code

But I have to say that the core of the reptile is to crawl the things that can be seen, that others have not announced is not visible. To recharge the film VIP can crawl VIP movies, this is not able to change, we can only borrow an account, one-time crawl and save.

That is, do not recharge the film VIP, nor go to the cinema, why the Python crawler is such a person?

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.