Python multi-threaded crawl power of the Thunder, you can write a thunderbolt download program in, but not recommended, because the memory is too large.
Perhaps some of the Python crawler is not very familiar with friends, to see the small part of the blog is not harvested, then I first introduce the principle of reptiles.
Countless web addresses (URLs) are woven into a web, called a network. The crawler will carefully select some URLs as a starting point, starting from these starting points, crawling and parsing the crawled page, extracting the information needed in the page, and getting a new URL inserted into the queue as the starting point for the next crawl. This keeps looping until you get all the information you want.
This Python crawler implements the first step to analyze the homepage structure of the movie Paradise site.
Parse Home Address extract classification information
In this function, the first step is to download the HTML source of the Web page, the XPath parse out the menu classification information, and create the corresponding file directory.
Parse the home page for each category
Opening the first page of all categories you can see that all have an identical structure, first parse out the node that contains the resource URL, and then extract the name and URL.
Resolve resource Address to file
Save the extracted information to a folder, in order to improve the efficiency of the crawler, the use of Python multi-threaded crawl, here for all the classification of the main page opened a thread, greatly enhance the efficiency of the crawler.
Results of crawling
Folder classification
The text address and the corresponding movie name
Get text address when open
Python All code
But I have to say that the core of the reptile is to crawl the things that can be seen, that others have not announced is not visible. To recharge the film VIP can crawl VIP movies, this is not able to change, we can only borrow an account, one-time crawl and save.
That is, do not recharge the film VIP, nor go to the cinema, why the Python crawler is such a person?