Python crawler multi-Threading download watercress Top250 movie pictures

Last Update:2018-06-19 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Crawler Project Introduction

?? This reptile project will crawl the picture of the Watercress Top250 movie, its URL is: https://movie.douban.com/top250, the specific page as shown:

?? The crawler project will not use multi-threaded and multi-threaded to complete, through the comparison, showing the multi-threading in the crawler project a huge advantage. The multithreading used in this article uses the Concurrent.futures module, which is the most widely used concurrency library in Python, and it makes it easy to parallelize tasks. In the Concurrent.futures module, there are two types of concurrent modules, respectively:

Multithreaded mode: Threadpoolexecutor, suitable for IO-intensive tasks;
Multi-process mode: Processpoolexecutor for compute-intensive tasks.

Specific information about the module can be found in its official website: https://docs.python.org/3/library/concurrent.futures.html.
?? This reptile project will be used in the Concurrent.futures module Threadpoolexecutor class, multi-threaded watercress Top250 movie pictures. The following will give the crawler project does not use multi-threaded and multi-threaded comparison, in order to show the multi-threaded in the crawler's huge advantages.

Do not use multithreading

?? First, we do not use multithreading to download the watercress Top250 movie picture, its complete Python code is as follows:

Import timeimport requestsimport urllib.requestfrom BS4 Import beautifulsoup# This function is used to download the image # incoming function: web page URL urldef download_ Picture (URL): # Gets the source code of the Web page r = requests.get (URL) # using BeautifulSoup to parse the retrieved text into HTML soup = BeautifulSoup (R.text, "l    XML ") # Gets the movie picture in the page content = soup.find (' div ', class_= ' article ') images = Content.find_all (' img ') # Gets the name of the movie picture and Picture_name_list = [image[' alt '] for image in images] picture_link_list = [image[' src '] for image in images] # Lee With Urllib.request. Urlretrieve official download image for Picture_name, picture_link in Zip (Picture_name_list, picture_link_list): Urllib.request.ur Lretrieve (Picture_link, ' e://douban/%s.jpg '% picture_name) def main (): # all 10 pages start_urls = ["Https://movie.douban . com/top250 "] for I in range (1, ten): Start_urls.append (" https://movie.douban.com/top250?start=%d&filter= "%    (+ * i)) # count the crawler's consumption time T1 = Time.time () print (' * ' * *) for the URL in start_urls:download_picture (URL) t2 = Time.tiMe () print (' No multi-threading, Total time:%s '% (T2-T1)) print (' * ' * *) Main ()

The output results are as follows:

**************************************************不使用多线程，总共耗时：79.93260931968689**************************************************

Go to the Douban folder in e-disk to view, such as:

?? We can see that in the case of not using multithreading, the crawler will take a total of about 80s, completed the watercress Top250 movie pictures download.

Using multithreading

?? Next, we use multithreading to download the watercress Top250 movie picture, its complete Python code is as follows:

Import timeimport requestsimport urllib.requestfrom BS4 import beautifulsoupfrom concurrent.futures Import Threadpoolexecutor, wait, all_completed# the function is used to download the image # Incoming function: The URL of the Web page urldef download_picture (URL): # Gets the source code of the page R = Requests . Get (URL) # uses BeautifulSoup to parse the retrieved text into HTML soup = BeautifulSoup (R.text, "lxml") # Gets the movie picture in the webpage content = Soup.fin D (' div ', class_= ' article ') images = Content.find_all (' img ') # Gets the name of the movie picture and picture_name_list = [image[' alt '] for IM Images] Picture_link_list = [image[' src '] for image in images] # using Urllib.request. Urlretrieve official download image for Picture_name, picture_link in Zip (Picture_name_list, picture_link_list): Urllib.request.ur Lretrieve (Picture_link, ' e://douban/%s.jpg '% picture_name) def main (): # all 10 pages start_urls = ["Https://movie.douban . com/top250 "] for I in range (1, ten): Start_urls.append (" https://movie.douban.com/top250?start=%d&filter= "%    (+ * i)) # count the crawler's consumption time print (' * ' * *) t3 = Time.time ()   # Take advantage of concurrent download movie Picture executor = Threadpoolexecutor (max_workers=10) # You can adjust the max_workers, that is, the number of threads # submit () arguments: The first is a function, then the    The function passed in parameters that allow for multiple future_tasks = [Executor.submit (download_picture, URL) for URLs in Start_urls] # waiting for all threads to complete before going into subsequent executions Wait (future_tasks, return_when=all_completed) T4 = Time.time () print (' Use multi-threading, Total time:%s '% (T4-T3)) print (' * ' * 50 ) Main ()

The output results are as follows:

**************************************************使用多线程，总共耗时：9.361606121063232**************************************************

Then go to the E-disk Douban folder to view, found also downloaded 250 movie pictures.

Summarize

?? Through the comparison of the two bots, it is not difficult to find that the same download watercress Top250 movie, 10 pages of the picture, in the absence of multithreading, the total time is about 80s, and in the use of multi-threaded (10 threads) in the case, the total time is about 9.5 seconds, the efficiency of a full increase of about 8 times times. Such efficiency gains are undoubtedly exciting in reptiles.
?? Hope readers after reading this blog, you can also try to use multi-threading in their own crawlers, there may be unexpected surprises oh ~ ~ because, the famous Python crawler frame scrapy, but also use multi-threading to improve the speed of the crawler Oh!

Note: I have now opened two public number: because Python (number: Python_math) and easy to learn the Python crawler (number: Easy_web_scrape), welcome to the attention OH ~ ~

Python crawler multi-Threading download watercress Top250 movie pictures

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Python crawler multi-Threading download watercress Top250 movie pictures

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Python crawler multi-Threading download watercress Top250 movie pictures

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support