This article mainly introduces the Getting Started Tutorial on network programming using threads in Python. This article is from the technical documentation on the IBM official website. For more information, see
Introduction
For Python, there is no lack of concurrency options. its standard library includes support for threads, processes, and asynchronous I/O. In many cases, Python simplifies the use of various concurrency methods by creating high-level modules such as asynchronous, thread, and sub-process. In addition to the standard library, there are also some third-party solutions, such as Twisted, Stackless, and process modules. This article focuses on the use of Python threads and uses some practical examples. Although many good online resources detail thread APIs, This article attempts to provide some practical examples to illustrate some common thread usage modes.
Global interpreter Lock indicates that the Python interpreter is not thread-safe. The current thread must hold a global lock for secure access to Python objects. Because only one thread can obtain the Python object/c api, the interpreter regularly releases and reacquires the lock every time it passes through 100 bytecode instructions. The interpreter can check the thread switching frequency through the sys. setcheckinterval () function.
In addition, the lock will be released and re-acquired based on potential blocking I/O operations. For more information, see Gil and Threading State and Threading the Global Interpreter Lock in the references section.
It should be noted that due to GIL, CPU-limited applications will not benefit from the use of threads. We recommend that you use processes or create processes and threads in a hybrid manner when using Python.
First, it is very important to clarify the differences between processes and threads. The difference between threads and processes is that they share status, memory, and resources. For a thread, this simple difference is both its advantage and its disadvantage. On the one hand, threads are lightweight and easy to communicate with each other, but on the other hand, they also bring about various problems including deadlocks, contention conditions and high complexity. Fortunately, because of GIL and the queue module, it is much less complex to use the Python language than to use other languages.
Use Python threads
To continue learning the content in this article, I assume that you have installed Python 2.5 or later, because many examples in this article will use the new features of the Python language, these features only appear after Python2.5. To start using a Python thread, we will start with a simple "Hello World" example:
Hello_threads_example
import threading import datetime class ThreadClass(threading.Thread): def run(self): now = datetime.datetime.now() print "%s says Hello World at time: %s" % (self.getName(), now) for i in range(2): t = ThreadClass() t.start()
If you run this example, you will get the following output:
# python hello_threads.py Thread-1 says Hello World at time: 2008-05-13 13:22:50.252069 Thread-2 says Hello World at time: 2008-05-13 13:22:50.252576
Observe the output result carefully. you can see that the Hello World statement is output from both threads, with a date stamp. If you analyze the actual code, you will find that there are two import statements. one statement imports the date and time module, and the other statement imports the thread module. ThreadClass inherits from threading. Thread. because of this, you need to define a run method to execute the code you want to run in this Thread. Note that self. getName () is a method used to determine the thread name.
The last three lines of code actually call this class and start the thread. If you pay attention to it, the actual starting thread is t. start (). Inheritance is taken into account when designing the thread module, and the thread module is actually built on the basis of the underlying thread module. In most cases, inheritance from threading. Thread is a best practice because it creates a regular API for Thread programming.
Use thread queue
As mentioned above, when multiple threads need to share data or resources, the usage of threads may become complicated. The thread module provides many synchronization primitives, including semaphores, conditional variables, events, and locks. When these options exist, the best practice is to focus on using queues. In comparison, queues are easier to process, and thread programming is more secure, because they can effectively transmit all access to resources from a single thread, it also supports a clearer and more readable design mode.
In the next example, you will first create a program that is executed in serial or in sequence to obtain the URL of the website and display the first 1024 bytes of the page. The following is a typical example. First, let's use the urllib2 module to obtain these pages (get a page at a time) and timing the code running time:
URL acquisition sequence
import urllib2 import time hosts = ["http://yahoo.com", "http://google.com", "http://amazon.com", "http://ibm.com", "http://apple.com"] start = time.time() #grabs urls of hosts and prints first 1024 bytes of page for host in hosts: url = urllib2.urlopen(host) print url.read(1024) print "Elapsed Time: %s" % (time.time() - start)
When running the above example, you will get a large number of output results in the standard output. However, you will get the following content:
Elapsed Time: 2.40353488922
Let's analyze this code carefully. You have imported only two modules. First, the urllib2 module reduces the complexity of work and obtains Web pages. Then, by calling time. time (), you create a start time value, call the function again, and subtract the start value to determine how long it took to execute the program. Finally, let's analyze the execution speed of the program. although the result of "2.5 seconds" is not too bad, if you need to retrieve hundreds of Web pages, follow the average value, it takes about 50 seconds. Study how to create a thread version that can increase the execution speed:
URL obtaining thread-based
#!/usr/bin/env python import Queue import threading import urllib2 import time hosts = ["http://yahoo.com", "http://google.com", "http://amazon.com", "http://ibm.com", "http://apple.com"] queue = Queue.Queue() class ThreadUrl(threading.Thread): """Threaded Url Grab""" def __init__(self, queue): threading.Thread.__init__(self) self.queue = queue def run(self): while True: #grabs host from queue host = self.queue.get() #grabs urls of hosts and prints first 1024 bytes of page url = urllib2.urlopen(host) print url.read(1024) #signals to queue job is done self.queue.task_done() start = time.time() def main(): #spawn a pool of threads, and pass them queue instance for i in range(5): t = ThreadUrl(queue) t.setDaemon(True) t.start() #populate queue with data for host in hosts: queue.put(host) #wait on the queue until everything has been processed queue.join() main() print "Elapsed Time: %s" % (time.time() - start)
For this example, more code needs to be explained, but it is not much complicated than the first thread example because the queue module is used. This mode is common and recommended when using threads in Python. The procedure is described as follows:
- Create a Queue. Queue () instance and fill it with data.
- Pass the instance with filled data to the Thread class, which is created by inheriting threading. Thread.
- Generate a daemon thread pool.
- Each time a project is retrieved from the queue, the data in the thread and the run method are used to execute the corresponding work.
- After this task is completed, use the queue. task_done () function to send a signal to the completed queue of the task.
- The join operation on the queue actually means that the main program is exited when the queue is empty.
When using this mode, note that by setting the daemon thread to true, the main thread or program can exit only when the daemon thread is active. This method creates a simple way to control the program flow, because before exiting, you can perform the join operation on the queue or wait until the queue is empty. The queue module documentation details the actual processing process. See references:
Join ()
Keep blocked until all projects in the queue are processed. When a project is added to the queue, the total number of unfinished tasks increases. When the user thread calls task_done () to retrieve the project and complete all the work, the total number of unfinished tasks is reduced. When the total number of unfinished tasks is reduced to zero, join () ends the blocking state.
Use multiple queues
Because the mode described above is very effective, you can connect to the additional thread pool and queue for expansion, which is quite simple. In the preceding example, you only output the starting part of the Web page. In the next example, the complete Web page obtained by each thread is returned and the result is placed in another queue. Then, set another thread pool added to the second queue, and then perform corresponding processing on the Web page. The work in this example includes parsing Web pages using a third-party Python module named Beautiful Soup. To use this module, you only need two lines of code to extract the title tag of each accessed page and print it out.
Multi-queue data mining websites
import Queueimport threadingimport urllib2import timefrom BeautifulSoup import BeautifulSouphosts = ["http://yahoo.com", "http://google.com", "http://amazon.com", "http://ibm.com", "http://apple.com"]queue = Queue.Queue()out_queue = Queue.Queue()class ThreadUrl(threading.Thread): """Threaded Url Grab""" def __init__(self, queue, out_queue): threading.Thread.__init__(self) self.queue = queue self.out_queue = out_queue def run(self): while True: #grabs host from queue host = self.queue.get() #grabs urls of hosts and then grabs chunk of webpage url = urllib2.urlopen(host) chunk = url.read() #place chunk into out queue self.out_queue.put(chunk) #signals to queue job is done self.queue.task_done()class DatamineThread(threading.Thread): """Threaded Url Grab""" def __init__(self, out_queue): threading.Thread.__init__(self) self.out_queue = out_queue def run(self): while True: #grabs host from queue chunk = self.out_queue.get() #parse the chunk soup = BeautifulSoup(chunk) print soup.findAll(['title']) #signals to queue job is done self.out_queue.task_done()start = time.time()def main(): #spawn a pool of threads, and pass them queue instance for i in range(5): t = ThreadUrl(queue, out_queue) t.setDaemon(True) t.start() #populate queue with data for host in hosts: queue.put(host) for i in range(5): dt = DatamineThread(out_queue) dt.setDaemon(True) dt.start() #wait on the queue until everything has been processed queue.join() out_queue.join()main()print "Elapsed Time: %s" % (time.time() - start)
If you run this version of the script, you will get the following output:
# python url_fetch_threaded_part2.py [Google] [Yahoo!] [Apple] [IBM United States] [Amazon.com: Online Shopping for Electronics, Apparel, Computers, Books, DVDs & more] Elapsed Time: 3.75387597084
When analyzing this code, you can see that we have added another queue instance, and then passed the queue to the first thread pool class ThreadURL. Next, for another thread pool class DatamineThread, the structure is almost identical. In the run method of this class, obtain the Web page and text block from each thread in the queue, and then use Beautiful Soup to process the text block. In this example, use Beautiful Soup to extract the title tag of each page and print it out. This example can be easily applied to some more valuable application scenarios, because you have mastered the core content of basic search engines or data mining tools. One idea is to use Beautiful Soup to extract links from each page and navigate by them.
Summary
This article studies Python threads and describes best practices for using queues to reduce complexity, reduce minor errors, and improve code readability. Although this basic mode is relatively simple, it can be used to solve various problems by connecting the queue and thread pool. In the last section, you begin to study how to create a more complex processing pipeline that can be used as a model for future projects. The references section provides many excellent references on regular concurrency and threads.
Finally, it is important to note that the thread cannot solve all the problems. in many cases, it may be more appropriate to use the process. In particular, when you only need to create many sub-processes and listen for the response, the standard library sub-process module may be easier to use. For more official instructions, see references.