Python crawler multi-Threading download watercress Top250 movie pictures

Source: Internet
Author: User

Crawler Project Introduction

?? This reptile project will crawl the picture of the Watercress Top250 movie, its URL is: https://movie.douban.com/top250, the specific page as shown:

?? The crawler project will not use multi-threaded and multi-threaded to complete, through the comparison, showing the multi-threading in the crawler project a huge advantage. The multithreading used in this article uses the Concurrent.futures module, which is the most widely used concurrency library in Python, and it makes it easy to parallelize tasks. In the Concurrent.futures module, there are two types of concurrent modules, respectively:

    • Multithreaded mode: Threadpoolexecutor, suitable for IO-intensive tasks;
    • Multi-process mode: Processpoolexecutor for compute-intensive tasks.

Specific information about the module can be found in its official website: https://docs.python.org/3/library/concurrent.futures.html.
?? This reptile project will be used in the Concurrent.futures module Threadpoolexecutor class, multi-threaded watercress Top250 movie pictures. The following will give the crawler project does not use multi-threaded and multi-threaded comparison, in order to show the multi-threaded in the crawler's huge advantages.

Do not use multithreading

?? First, we do not use multithreading to download the watercress Top250 movie picture, its complete Python code is as follows:

Import timeimport requestsimport urllib.requestfrom BS4 Import beautifulsoup# This function is used to download the image # incoming function: web page URL urldef download_ Picture (URL): # Gets the source code of the Web page r = requests.get (URL) # using BeautifulSoup to parse the retrieved text into HTML soup = BeautifulSoup (R.text, "l    XML ") # Gets the movie picture in the page content = soup.find (' div ', class_= ' article ') images = Content.find_all (' img ') # Gets the name of the movie picture and Picture_name_list = [image[' alt '] for image in images] picture_link_list = [image[' src '] for image in images] # Lee With Urllib.request. Urlretrieve official download image for Picture_name, picture_link in Zip (Picture_name_list, picture_link_list): Urllib.request.ur Lretrieve (Picture_link, ' e://douban/%s.jpg '% picture_name) def main (): # all 10 pages start_urls = ["Https://movie.douban . com/top250 "] for I in range (1, ten): Start_urls.append (" https://movie.douban.com/top250?start=%d&filter= "%    (+ * i)) # count the crawler's consumption time T1 = Time.time () print (' * ' * *) for the URL in start_urls:download_picture (URL) t2 = Time.tiMe () print (' No multi-threading, Total time:%s '% (T2-T1)) print (' * ' * *) Main () 

The output results are as follows:

**************************************************不使用多线程,总共耗时:79.93260931968689**************************************************

Go to the Douban folder in e-disk to view, such as:

?? We can see that in the case of not using multithreading, the crawler will take a total of about 80s, completed the watercress Top250 movie pictures download.

Using multithreading

?? Next, we use multithreading to download the watercress Top250 movie picture, its complete Python code is as follows:

Import timeimport requestsimport urllib.requestfrom BS4 import beautifulsoupfrom concurrent.futures Import Threadpoolexecutor, wait, all_completed# the function is used to download the image # Incoming function: The URL of the Web page urldef download_picture (URL): # Gets the source code of the page R = Requests . Get (URL) # uses BeautifulSoup to parse the retrieved text into HTML soup = BeautifulSoup (R.text, "lxml") # Gets the movie picture in the webpage content = Soup.fin D (' div ', class_= ' article ') images = Content.find_all (' img ') # Gets the name of the movie picture and picture_name_list = [image[' alt '] for IM Images] Picture_link_list = [image[' src '] for image in images] # using Urllib.request. Urlretrieve official download image for Picture_name, picture_link in Zip (Picture_name_list, picture_link_list): Urllib.request.ur Lretrieve (Picture_link, ' e://douban/%s.jpg '% picture_name) def main (): # all 10 pages start_urls = ["Https://movie.douban . com/top250 "] for I in range (1, ten): Start_urls.append (" https://movie.douban.com/top250?start=%d&filter= "%    (+ * i)) # count the crawler's consumption time print (' * ' * *) t3 = Time.time ()   # Take advantage of concurrent download movie Picture executor = Threadpoolexecutor (max_workers=10) # You can adjust the max_workers, that is, the number of threads # submit () arguments: The first is a function, then the    The function passed in parameters that allow for multiple future_tasks = [Executor.submit (download_picture, URL) for URLs in Start_urls] # waiting for all threads to complete before going into subsequent executions Wait (future_tasks, return_when=all_completed) T4 = Time.time () print (' Use multi-threading, Total time:%s '% (T4-T3)) print (' * ' * 50 ) Main ()

The output results are as follows:

**************************************************使用多线程,总共耗时:9.361606121063232**************************************************

Then go to the E-disk Douban folder to view, found also downloaded 250 movie pictures.

Summarize

?? Through the comparison of the two bots, it is not difficult to find that the same download watercress Top250 movie, 10 pages of the picture, in the absence of multithreading, the total time is about 80s, and in the use of multi-threaded (10 threads) in the case, the total time is about 9.5 seconds, the efficiency of a full increase of about 8 times times. Such efficiency gains are undoubtedly exciting in reptiles.
?? Hope readers after reading this blog, you can also try to use multi-threading in their own crawlers, there may be unexpected surprises oh ~ ~ because, the famous Python crawler frame scrapy, but also use multi-threading to improve the speed of the crawler Oh!

Note: I have now opened two public number: because Python (number: Python_math) and easy to learn the Python crawler (number: Easy_web_scrape), welcome to the attention OH ~ ~

Python crawler multi-Threading download watercress Top250 movie pictures

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.