Traditional multithreaded scenarios use the "instant Create, instant destroy" strategy. Although the time to create a thread has been greatly shortened compared to the creation process, if the task submitted to the thread is a short execution time and is executed very frequently, the server will be constantly creating threads and destroying the state of the thread.
The run time of a thread can be divided into 3 parts: The start time of the thread, the run time of the thread body, and the time the thread was destroyed. In a multithreaded scenario, if a thread cannot be reused, it means that each creation requires 3 processes to be started, destroyed, and run. This inevitably increases the system's corresponding time and reduces the efficiency.
Using the thread pool:
Because threads are pre-created and placed in the thread pool, and are not destroyed but are scheduled to process the next task after the current task has been processed, it is possible to avoid creating threads multiple times, thus saving the overhead of thread creation and destruction, resulting in better performance and system stability.
Try using thread pooling to implement crawlers
The thread pool class library needs to be installed before use:
Pip Install ThreadPool
#!/usr/bin/env python#Coding:utf-8#@Time: 2018/4/19 16:06#@Author: Chenjisheng#@File: 17zwd_sample.py#@Mail: [email protected] fromBs4ImportBeautifulSoupImportThreadPoolImportRequestsImportThreadingImportDatetimebaseurl="http://hz.17zwd.com/sks.htm?cateid=0&page="#Reptile FunctiondefgetResponse (URL): Target= BaseURL +URL Content=Requests.get (target). Text Soup= BeautifulSoup (Content,'lxml') Tags= Soup.find_all ('Div', attrs={"class":"Huohao-img-container"}) forTaginchTags:imgurl= Tag.find ('img'). Get ('data-original') #print (Imgurl)#defines a thread of 10StartTime =Datetime.datetime.now () pool= ThreadPool. ThreadPool (10)#defining tasks for the thread pooltasks = Threadpool.makerequests (GetResponse, [str (x) forXinchRange (1, 11)])#To start a task using the thread pool[Pool.putrequest (Task) forTaskinchtasks]pool.wait () Endtime=Datetime.datetime.now () alltime= (Endtime-starttime). SecondsPrint("Total thread pool time is: {} seconds". Format (alltime))#Traditional ThreadingStarttime1 =Datetime.datetime.now () tasklist= [Threading. Thread (Target=getresponse (str (x))) forXinchRange (1, 11)] forIinchTasklist:i.start () forIinchtasklist:i.join () endtime1=Datetime.datetime.now () alltime1= (Endtime1-starttime1). SecondsPrint("traditional threads are always time consuming: {} seconds". Format (alltime1))if __name__=="__main__": Pass
Final execution Result: thread pool takes 3 seconds, traditional threads take 9 seconds;
The difference is still quite big ha;
Python Thread pool Usage