A brief analysis of multi-process and multi-threading usage in Python

Last Update:2016-06-10 Source: Internet

Author: User

Tags thread class redis server

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

In the discussion of Python, it is often said that Python multithreading is so difficult to use. Others point to global interpreter lock (also affectionately known as the "GIL"), saying it blocks Python's multi-threaded routines from running concurrently. So, if you're coming in from another language (like C + + or Java), the Python threading module doesn't run as you think. It is important to note that we can still use Python to write code that can be concurrency or parallel, and can bring a significant improvement in performance, as long as you can take into account some things. If you haven't seen it before, I suggest you look at Eqbal Quran's article "Concurrency and Parallelism in Ruby."

In this article, we will write a small Python script to download the most popular images on Imgur. We will start with a version of the image downloaded sequentially, which is an underground download. Before that, you have to register an application on the Imgur. If you do not have a Imgur account, please register one first.

The script in this article was tested in Python3.4.2. If you change it slightly, you should also be able to run it in Python2.--urllib is the most distinguishing part of the two versions.
Get started.

Let's start by creating a Python module called "download.py". This file contains all the functions needed to get a list of pictures and download them. We divide these functions into three separate functions:

  Get_links  Download_link  setup_download_dir

The third function, "Setup_download_dir", is used to create the downloaded target directory (if it does not exist).

The Imgur API requires HTTP requests to support the "Authorization" header with the client ID. You can find this client ID from the panel of your registered Imgur app, and the response will be encoded in JSON. We can use Python's standard JSON library to decode. Downloading pictures is easier, you just need to get pictures based on their URLs, and then write to a file.

The code is as follows:

Import jsonimport loggingimport osfrom pathlib import pathfrom urllib.request import urlopen, request logger = Logging.get Logger (__name__) def get_links (client_id):  headers = {' Authorization ': ' Client-id {} '. Format (client_id)}  req = Request (' https://api.imgur.com/3/gallery/', headers=headers, method= ' GET ') with  Urlopen (req) as RESP:    data = Json.loads (Resp.readall (). Decode (' Utf-8 '))  return map (lambda item:item[' link ', data[' data ') def Download_ Link (directory, link):  logger.info (' downloading%s ', link)  download_path = directory/os.path.basename (link)  with Urlopen (link) as Image, Download_path.open (' WB ') as F:    F.write (Image.readall ()) def Setup_download_dir () :  Download_dir = Path (' images ')  if not download_dir.exists ():    download_dir.mkdir ()  return Download_dir

Next, you need to write a module that uses these functions to download images individually. We named it "single.py". It contains the main functions of our most original version of the Imgur picture downloader. This module will go through the environment variable "imgur_client_id" to get the CLIENT ID of IMGUR. It will call "Setup_download_dir" to create the download directory. Finally, use the Get_links function to get a list of images, filter out all GIF and album URLs, and then use "Download_link" to download and save the images to disk. Here is the code for "single.py":

Import loggingimport osfrom time import time from download import Setup_download_dir, get_links, Download_link Logging.bas Icconfig (level=logging. DEBUG, format= '% (asctime) s-% (name) s-% (levelname) s-% (message) s ') Logging.getlogger (' requests '). SetLevel (Logging. CRITICAL) logger = Logging.getlogger (__name__) def main ():  ts = time ()  client_id = os.getenv (' imgur_client_id ')  if not client_id:    raise Exception ("couldn ' t find imgur_client_id Environment variable!")  Download_dir = Setup_download_dir ()  links = [L for L in Get_links (client_id) if L.endswith ('. jpg ')] for  link in Li Nks:    download_link (download_dir, link)  print (' Took {}s '. Format (Time ()-TS)) If __name__ = = ' __main__ ':  Main ()

In my notebook, this script took 19.4 seconds to download 91 images. Please note that these numbers will vary on different networks. 19.4 seconds is not very long, but what if we want to download more pictures? Maybe it's 900, not 90. It takes about 3 minutes to download an image for an average of 0.2 seconds and 900 sheets. Then 9000 pictures will take 30 minutes. The good news is that with concurrency or parallelism, we can significantly increase this speed.

The next code example will display only import statements for the imported module and the new module. All relevant Python scripts can be conveniently found here in this GitHub repository.
Using threads

Threading is one of the most famous ways to implement concurrency and parallelism. The operating system generally provides the characteristics of the thread. Threads are smaller than processes and share the same block of memory space.

Here, we will write a new module that replaces "single.py". It will create a pool of eight threads, plus a total of nine threads for the main thread. The reason is eight threads because my computer has 8 CPU cores, and a worker thread corresponds to a kernel that looks good. In practice, the number of threads is carefully scrutinized, and other factors, such as other applications and services running on the same machine, need to be taken into account.

The following script is almost the same as before, except we now have a new class, Downloadworker, a subclass of the thread class. The Run method that runs an infinite loop has been overridden. At each iteration, it calls "Self.queue.get ()" To fetch a URL from a thread-safe queue. It will be blocked until one of the elements in the queue is to be processed. Once the worker thread gets an element from the queue, it will invoke the "Download_link" method used in the previous script to download the image to the directory. After the download is complete, the worker thread sends a signal to the queue to complete the task. This is important because the queue keeps track of the number of tasks in the queue. If the worker thread does not signal the completion of the task, the call to "Queue.join ()" will cause the entire main thread to be blocked.

From queue import queuefrom threading Import thread class Downloadworker (thread): Def __init__ (self, queue): thread.__       Init__ (self) self.queue = Queue def run (self): when True: # Get The work from the queue and expand the tuple # Get tasks from the queue and extend the tuple directory, link = self.queue.get () download_link (directory, link) self.queue.task_ Do () def main (): TS = time () client_id = os.getenv (' imgur_client_id ') if not client_id:raise Exception ("couldn ' t  Find imgur_client_id Environment variable! ") Download_dir = Setup_download_dir () links = [L for L in Get_links (client_id) if L.endswith ('. jpg ')] # Create a queue to Communicate with the worker threads queue = Queue () # Create 8 worker threads # creates eight worker threads for x in range (8): Worker    = Downloadworker (queue) # Setting daemon to True would let the main thread exit even though the workers is blocking # setting Daemon to True will cause the main thread to exit, even if the worker is blocked Worker.daemon = True Worker.start () # put the tasks into the queue as a tuple # puts the task into the queue in the form of a tuple for link in links:logger.info (' queueing {} '. Format (link)) Queue.put ((d  Ownload_dir, link) # causes the main thread to wait for the ' queue to ' finish processing all the tasks # Let the main thread wait for the queue to complete all of the task Queue.join () print (' Took {} '. Format (Time ()-TS))

Running this script on the same machine, the download time becomes 4.1 seconds! That's 4.7 times times faster than the previous example. Although this is a lot faster, but still to mention, because of Gil's sake, in this process at the same time only one thread is running. Therefore, this code is concurrent, but not parallel. And it's still getting faster because it's an IO-intensive task. The process downloads pictures with no effort at all, and the main time is spent waiting on the network. This is why threads can provide a great speed boost. A process can constantly convert threads whenever one of the threads is ready to work. Using a threading module in Python or another interpreted language with Gil can actually degrade performance. If your code performs CPU-intensive tasks, such as extracting gzip files, using the threading module will cause the execution time to grow longer. For CPU-intensive tasks and true parallel execution, we can use multi-process (multiprocessing) modules.

The official Python implementation--cpython--with the Gil, but not all of the Python implementations are like this. For example, IronPython, use. NET Framework implementation of Python is not Gil, Java-based implementation of Jython also does not. You can click here to view the existing Python implementations.
Generate multi-process

Multi-process modules are easier to use than threading modules because we don't need to add a class like the threading example. The only change we need to make is in the main function.

In order to use multiple processes, we have to build a multi-process pool. With the map method it provides, we pass the URL list to the pool, and then 8 new processes are generated, and they will download the image in parallel. This is true parallelism, but it comes at a price. The memory of the entire script will be copied into each sub-process. This is nothing in our case, but it can easily lead to serious problems in a large program.

From functools import partialfrom multiprocessing.pool Import Pool def main ():  ts = time ()  client_id = os.getenv (' imgur_client_id ')  if not client_id:    raise Exception ("couldn ' t find imgur_client_id Environment variable!")  Download_dir = Setup_download_dir ()  links = [L for L in Get_links (client_id) if L.endswith ('. jpg ')]  download = par Tial (Download_link, Download_dir) with  Pool (8) as P:    p.map (download, links)  print (' Took {}s '. Format (time ()-TS))

Distributed tasks

You already know that threading and multi-process modules can give you a lot of help with running scripts on your own computer, so what do you do when you want to perform tasks on different machines, or when you need to scale beyond the capabilities of a single machine? A good use case is a long-time background task for Web applications. If you have some time-consuming tasks, you don't want to use the same machine to take up some of the other application code's child processes or threads. This will reduce the performance of your application and affect your users. It would be nice to run these tasks on a different machine or even many other machines.

The Python library RQ is ideal for this type of task. It's a simple but powerful library. First, a function and its arguments are placed in the queue. It serializes the representation of a function call (pickle), and then adds those representations to a redis list. Tasks entering the queue are just the first steps and nothing has been done. At least we need a worker (worker thread) who can listen to the task queue.

The first step is to install and use a Redis server on your computer, or have access to a Redis server that works properly. Next, there are only a few minor changes to the existing code. First create an instance of the RQ queue and pass it to a Redis server via the Redis-py library. Then we execute "q.enqueue (download_link, Download_dir, link)" Instead of just calling "Download_link". The first parameter of the Enqueue method is a function that, when the task is actually executed, is passed to the function by other parameter or keyword arguments.

The final step is to start some worker. RQ provides a handy script to run a worker on the default queue. As soon as you execute "rqworker" in the terminal window, you can start listening to the default queue. Make sure your current working directory is the same as the one where the script is located. If you want to listen to other queues, you can perform "Rqworker queue_name" and then you will start executing a queue named queue_name. A good point of RQ is that as long as you can connect to Redis, you can run any number of workers on any number of machines, so it can improve your application extensibility. The following is the RQ version of the code:

From Redis import redisfrom RQ import Queue def main ():  client_id = os.getenv (' imgur_client_id ')  if not client_id:    Raise Exception ("couldn ' t find imgur_client_id Environment variable!")  Download_dir = Setup_download_dir ()  links = [L for L in Get_links (client_id) if L.endswith ('. jpg ')]  q = Queue (conn Ection=redis (host= ' localhost ', port=6379)) for  link in Links:    q.enqueue (download_link, Download_dir, link)

However, RQ is not the only solution for the Python task queue. RQ is really easy to use and can play a big role in simple cases, but if there are more advanced requirements, we can work with other solutions (such as celery).
Summarize

If your code is IO-intensive, threading and multi-process can help you. Multiple processes are easier to use than threads, but consume more memory. If your code is CPU-intensive, multiple processes are clearly a better choice-especially if the machine you are using is multicore or multi-CPU. For Web applications, RQ is a better choice when you need to scale up to multiple machines to perform tasks.



This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More