In the criticism of Python, it's often said that Python multithreading is hard to use. Others point to global interpreter lock (also affectionately known as "GIL"), saying it prevents Python's multithreaded routines from running simultaneously. So, if you turn around from another language (such as C + + or Java), the Python threading module does not run as you would expect. It's important to note that we can still write code that can be concurrent or parallel in Python, and can bring a significant improvement in performance, as long as you take into account something. If you haven't read it, I suggest you take a look at Eqbal Quran's article, "Concurrency and parallelism in Ruby."
In this article, we'll write a small Python script to download the hottest pictures on the Imgur. We will start by downloading a version of the picture in sequence, i.e. an underground download. Before that, you have to sign up for a Imgur application. If you do not have a Imgur account, please register first.
The scripts in this article are tested through in Python3.4.2. A slight change, and should also be able to run in Python2--urllib is the most distinguished part of the two versions.
Get started .
Let's start by creating a Python module called "download.py." This file contains all the functions needed to get a list of pictures and to download the pictures. We divide these functions into three separate functions:
Get_links
Download_link
setup_download_dir
The third function, "Setup_download_dir," is used to create the target directory of the download (if it does not exist).
The Imgur API requires that HTTP requests support a "Authorization" header with a client ID. You can find this client ID from the panel of your registered Imgur application, and the response will be encoded in JSON. We can use Python's standard JSON library to decode it. Downloading pictures is simpler, you just need to get the picture based on their URL and write to a file.
The code is as follows:
Import JSON
import logging import
os
pathlib import Path
from urllib.request import Urlopen Request
logger = Logging.getlogger (__name__)
def get_links (client_id):
headers = {' Authorization ': ' Client-id {} '. Format (client_id)}
req = Request (' https://api.imgur.com/3/gallery/', headers=headers, method= ') Get ') with
Urlopen (req) as resp:
data = Json.loads (Resp.readall (). Decode (' Utf-8 ')] return
map (lambda item:item[' link ', data[' data ']
def download_link (directory, link):
logger.info (' downloading%s ', link)
download_path = directory/os.path.basename (link)
with Urlopen (link) as Image, Download_path.open (' WB ') as F :
F.write (Image.readall ())
def setup_download_dir ():
download_dir = Path (' images ')
if not Download_dir.exists ():
Download_dir.mkdir () return
Download_dir
Next, you need to write a module that uses these functions to download pictures one at a while. We named it "single.py". It contains the main functions of our original version of the Imgur picture downloader. The module will get the CLIENT ID of the Imgur through the environment variable "imgur_client_id". It will invoke "Setup_download_dir" to create the download directory. Finally, use the Get_links function to get a list of pictures, filter out all the GIF and album URLs, and then use "Download_link" to download and save the pictures on disk. The following is the code for "single.py":
Import logging
import OS from time
import
download import Setup_download_dir, get_links, Download_ Link
logging.basicconfig (level=logging. DEBUG, format= '% (asctime) s-% (name) s-% (levelname) s-% (message) s ')
Logging.getlogger (' requests '). Setlevel ( Logging. CRITICAL)
logger = Logging.getlogger (__name__)
def main ():
ts = time ()
client_id = os.getenv (' Imgur_ client_id ')
if not client_id:
raise Exception ("couldn ' t find imgur_client_id Environment-variable!")
Download_dir = Setup_download_dir ()
links = [L for L-get_links (client_id) if L.endswith ('. jpg ')] for
link in Links:
download_link (download_dir, link)
print (' Took {}s '. Format (Time ()-TS))
If __name__ = ' __main_ _ ':
Main ()
In my notebook, this script took 19.4 seconds to download 91 pictures. Please note that these numbers vary from network to web. 19.4 seconds is not very long, but what if we want to download more pictures? Maybe 900, not 90. An average of 0.2 seconds to download a picture, 900 of words will take about 3 minutes. Then 9000 photos will take 30 minutes. The good news is that when concurrency or parallelism is used, we can dramatically increase that speed.
The following code example will only show import statements that are imported into a specific module and a new module. All of the relevant Python scripts can conveniently find this GitHub repository.
Using Threads
Threading is one of the most famous ways to implement concurrency and parallelism. The operating system generally provides threading characteristics. Threads are smaller than processes and share the same block of memory space.
Here, we will write a new module that replaces "single.py". It will create a pool with eight threads, plus a total of nine threads in the main thread. It's eight threads because my computer has 8 CPU cores, and a worker thread looks good for a kernel. In practice, the number of threads is carefully tailored to take into account other factors, such as other applications and services running on the same machine.
The following script is almost the same as before, except that we now have a new class, Downloadworker, a subclass of the thread class. The Run method that runs an infinite loop has been overridden. At each iteration, it calls "Self.queue.get ()" To try to get a URL from a thread-safe queue. It will remain blocked until a queue appears with an element to process. Once the worker thread gets an element from the queue, it invokes the "Download_link" method used in the previous script to download the picture to the table of contents. After the download is complete, the worker thread sends a signal to the queue that the task completes. This is important because the queue keeps track of the number of tasks in the queue. If the worker thread does not signal that the task is complete, the call to "Queue.join ()" will cause the entire main thread to be blocked.
From queue import queue from threading Import Thread class Downloadworker (thread): Def __init__ (self, queue): THR Ead.__init__ (self) self.queue = Queue def run (self): while True: # Get the work from the queue and Expan d The tuple # gets the task from the queue and expands tuple directory, link = self.queue.get () download_link (directory, link) s Elf.queue.task_done () def main (): TS = time () client_id = os.getenv (' imgur_client_id ') if not client_id:raise
Exception ("couldn ' t find imgur_client_id environment variable!") Download_dir = Setup_download_dir () links = [L for L-get_links (client_id) if L.endswith ('. jpg ')] # Create a queue T
o Communicate with the worker threads queue = Queue () # Create 8 worker threads # creates eight worker threads for x in range (8): Worker = Downloadworker (queue) # Setting daemon to True'll let the main thread exit even though the workers are King # Setting Daemon to True will cause the main thread to exit, even if the worker blocks Worker.daemon = True woRker.start () # puts the tasks into the queue as a tuple # puts the task in tuple form for link in links:logger.info (' queue ing {} '. Format (link)] queue.put ((Download_dir, link) # causes the main thread to wait for the \ Finish Proce
Ssing All Tasks # Let the main thread wait for the queue to complete all task Queue.join () print (' Took {} '. Format (Time ()-TS))
Run this script on the same machine, the download time becomes 4.1 seconds! That's 4.7 times times faster than the previous example. Although this is a lot quicker, it is also to mention that, because of Gil, there is only one thread running at the same time in this process. Therefore, this code is concurrent but not parallel. And the reason it's still getting faster is because it's an IO-intensive task. The process is effortless when downloading pictures, and the main time is spent waiting on the network. This is why threads can provide a great speed boost. The process can continuously convert threads whenever one of the threads is ready to work. Using a threading module in Python or in another Gil-interpreted language can actually degrade performance. If your code performs a CPU-intensive task, such as extracting the gzip file, using the threading module will result in longer execution time. For CPU-intensive tasks and true parallel execution, we can use multiple process (multiprocessing) modules.
The official Python implementation--cpython--with the Gil, but not all of the Python implementations are. For example, IronPython, use. NET Framework implementation of Python there is no Gil, the Java-based implementation of the Jython is also not. You can click here to view the existing Python implementations.
Generate multiple Processes
Multi-process modules are easier to use than thread modules because we don't need to add a class like the threading example. The only thing we need to do is change in the main function.
In order to use multiple processes, we have to build a multiple process pool. By providing the map method, we pass the URL list to the pool, and then 8 new processes are generated, and they will download the picture in parallel. This is true parallelism, but it comes at a price. The memory of the entire script will be copied into the various child processes. This is nothing in our example, but it can easily lead to serious problems in large programs.
From Functools import partial to
multiprocessing.pool Import Pool
def main ():
ts = time ()
client_id = O S.getenv (' imgur_client_id ')
if not client_id:
raise Exception ("couldn" t find imgur_client_id Environment variable! ")
Download_dir = Setup_download_dir ()
links = [L for L-get_links (client_id) if L.endswith ('. jpg ')]
download = par Tial (Download_link, Download_dir) with
Pool (8) as P:
p.map (download, links)
print (' Took {}s '. Format ( Time ()-TS)
Distributed tasks
You already know that threads and multiple process modules can provide a lot of help when running scripts for your own computer, so what do you do when you want to perform a task on a different machine, or when you need to scale beyond the capabilities of a single machine? A good use case is a long time background task of network application. If you have some time-consuming tasks, you will not want to use the same machine to occupy some of the child processes or threads required by other application code. This will reduce the performance of your application and affect your users. It would be nice if you could run these tasks on another machine or even many other machines.
The Python library RQ is very useful for such tasks. It's a simple but powerful library. First, a function and its arguments are placed in the queue. It serializes the representation of the function call (pickle) and then adds the representations to a Redis list. The task to enter the queue is only the first step, nothing has been done. We also need at least one worker (worker thread) to listen to the task queue.
The first step is to install and use the Redis server on your computer, or to have a normal use of the Redis server. Then, there are only a few minor changes to the existing code. First create an instance of the RQ queue and pass the Redis-py library to a Redis server. Then we execute "q.enqueue (download_link, Download_dir, link)" Instead of just calling "Download_link". The first parameter of the Enqueue method is a function, and when the task is actually executed, other parameters or keyword parameters are passed to the function.
The final step is to start some worker. RQ provides a handy script to run the worker on the default queue. As long as you perform "Rqworker" in the terminal window, you can start listening to the default queues. Please make sure your current working directory is the same as the script. If you want to monitor other queues, you can perform "Rqworker queue_name" and then start executing the queue named Queue_name. A good point of RQ is that as long as you can connect to Redis, you can run any number of worker on any number of machines, so it can improve your application extensibility. The following is the RQ version of the code:
From Redis import Redis from
RQ import \
def main ():
client_id = os.getenv (' imgur_client_id ')
if not client_id:
Raise Exception ("couldn ' t find imgur_client_id environment")
Download_dir = Setup_download_dir ()
links = [L for L-get_links (client_id) if L.endswith ('. jpg ')]
q = Queue (conn Ection=redis (host= ' localhost ', port=6379)) for
link in Links:
q.enqueue (download_link, Download_dir, link)
However, RQ is not the only solution to the Python task queue. RQ is really easy to use and can play a big role in simple cases, but if there are more advanced requirements, we can work with other solutions (such as celery).
Summary
If your code is IO-intensive, threads and multiple processes can help you. Multiple processes are easier to use than threads, but consume more memory. If your code is CPU-intensive, multiple processes are clearly a better choice-especially if the machines used are multi-core or multiple CPUs. For Web applications, RQ is a better option when you need to extend to multiple machines to perform tasks.