The following small series for you to share a Python multi-threaded crawler to crawl the case of embarrassing encyclopedia, with a good reference value, I hope to help you. Join the small partners who are interested in Python.
Multi-threaded crawler: That is, some program sections in parallel execution,
Make the crawler more efficient by properly setting up multiple threads
Embarrassing encyclopedia, common crawler and multi-threaded crawler
Analyzing the URL link concludes:
https://www.qiushibaike.com/8hr/page/page/
Multi-threaded crawler is similar to Java multithreading, directly on the code
"' #此处代码为普通爬虫import urllib.requestimport urllib.errorimport reheaders = (" User-agent "," mozilla/5.0 (Windows NT 6.1; WOW64) applewebkit/537.36 (khtml, like Gecko) chrome/55.0.2883.87 safari/537.36 ") opener = Urllib.request.build_opener ( ) Opener.addheaders = [Headers]urllib.request.install_opener (opener) for I in range: url = "https:// www.qiushibaike.com/8hr/page/"+str (i) +"/"Pagedata = Urllib.request.urlopen (URL). read (). Decode (" Utf-8 "," ignore ") Pattern = ' <p class= ' content >.*?<span> (. *?) </span> (. *?) </p> ' DataList = Re.compile (Pattern,re. S). FindAll (Pagedata) for J in Range (0,len (DataList)): Print ("+str (i) +" page "+str (j) +" Satin Content: ") print (Datalist[j])" "#此处为多线程介绍代码import Threading #导入多线程包class A (threading. Thread): #创建一个多线程A def init (self): #必须包含的两个方法之一: Initializes thread threading. Thread.init (self) def run (self): #必须包含的两个方法之一: Thread runs method for I in range (0,11): print ("I am Thread A") class B (threading. Thread): #创建一个多线程A def init (self): #必须包含的两个方法之一: Initializes thread threading. Thread.init (SELF) def run (self): #必须包含的两个方法之一: Thread Run method for I in range (0,11): print ("I am thread B") T1 = A () #线程实例化t1. Start () #线程运行t2 = B () t2. Start () ' #此处为修改后的多线程爬虫 # Crawl of odd and even pages using multithreading import urllib.requestimport urllib.errorimport reimport threadingheaders = (" User-agent "," mozilla/5.0 (Windows NT 6.1; WOW64) applewebkit/537.36 (khtml, like Gecko) chrome/55.0.2883.87 safari/537.36 ") opener = Urllib.request.build_opener ( ) Opener.addheaders = [Headers]urllib.request.install_opener (opener) class one (threading. Thread): #爬取奇数页内容 def init (self): threading. Thread.init (self) def run (self): for I in Range (1,12,2): url = "https://www.qiushibaike.com/8hr/page/" +str (i) + "/" Pag edata = Urllib.request.urlopen (URL). read (). Decode ("Utf-8", "ignore") pattern = ' <p class= ' content ' >.*?<span > (. *?) </span> (. *?) </p> ' DataList = Re.compile (Pattern,re. S). FindAll (Pagedata) for J in Range (0,len (DataList)): Print ("+str (i) +" page "+str (j) +" Satin Content: ") print (Datalist[j]) CLA SS-Threading. Thread): #爬取奇数页内容 deF init (self): threading. Thread.init (self) def run (self): for I in Range (2,12,2): url = "https://www.qiushibaike.com/8hr/page/" +str (i) + "/" Pag edata = Urllib.request.urlopen (URL). read (). Decode ("Utf-8", "ignore") pattern = ' <p class= ' content ' >.*?<span > (. *?) </span> (. *?) </p> ' DataList = Re.compile (Pattern,re. S). FindAll (Pagedata) for J in Range (0,len (DataList)): Print ("+str (i) +" page "+str (j) +" Satin Content: ") print (Datalist[j]) t1 = one () t2 = one () T1.start () T2.start ()
Above this python multi-threaded crawler Walkthrough _ Crawl embarrassing Encyclopedia of the case is a small part of the whole content to share to everyone, I hope to give you a reference, but also hope that we support topic.alibabacloud.com.
Related recommendations:
Python data structures and algorithms common allocation sorting example "bucket sort and cardinal sort" _python
XLWT setting Excel cell fonts in Python pass-through method
The Python language realizes the example of Baidu speech recognition function