Python multi-Threaded crawler Walkthrough _ Crawl the case of embarrassing encyclopedia _python

Source: Internet
Author: User
The following small series for you to share a Python multi-threaded crawler to crawl the case of embarrassing encyclopedia, with a good reference value, I hope to help you. Join the small partners who are interested in Python.

Multi-threaded crawler: That is, some program sections in parallel execution,

Make the crawler more efficient by properly setting up multiple threads

Embarrassing encyclopedia, common crawler and multi-threaded crawler

Analyzing the URL link concludes:

https://www.qiushibaike.com/8hr/page/page/

Multi-threaded crawler is similar to Java multithreading, directly on the code


"' #此处代码为普通爬虫import urllib.requestimport urllib.errorimport reheaders = (" User-agent "," mozilla/5.0 (Windows NT 6.1; WOW64) applewebkit/537.36 (khtml, like Gecko) chrome/55.0.2883.87 safari/537.36 ") opener = Urllib.request.build_opener ( ) Opener.addheaders = [Headers]urllib.request.install_opener (opener) for I in range: url = "https:// www.qiushibaike.com/8hr/page/"+str (i) +"/"Pagedata = Urllib.request.urlopen (URL). read (). Decode (" Utf-8 "," ignore ") Pattern = ' <p class= ' content >.*?<span> (. *?) </span> (. *?) </p> ' DataList = Re.compile (Pattern,re. S). FindAll (Pagedata) for J in Range (0,len (DataList)): Print ("+str (i) +" page "+str (j) +" Satin Content: ") print (Datalist[j])" "#此处为多线程介绍代码import Threading #导入多线程包class A (threading. Thread): #创建一个多线程A def init (self): #必须包含的两个方法之一: Initializes thread threading. Thread.init (self) def run (self): #必须包含的两个方法之一: Thread runs method for I in range (0,11): print ("I am Thread A") class B (threading. Thread): #创建一个多线程A def init (self): #必须包含的两个方法之一: Initializes thread threading. Thread.init (SELF) def run (self): #必须包含的两个方法之一: Thread Run method for I in range (0,11): print ("I am thread B") T1 = A () #线程实例化t1. Start () #线程运行t2 = B () t2. Start () ' #此处为修改后的多线程爬虫 # Crawl of odd and even pages using multithreading import urllib.requestimport urllib.errorimport reimport threadingheaders = (" User-agent "," mozilla/5.0 (Windows NT 6.1; WOW64) applewebkit/537.36 (khtml, like Gecko) chrome/55.0.2883.87 safari/537.36 ") opener = Urllib.request.build_opener ( ) Opener.addheaders = [Headers]urllib.request.install_opener (opener) class one (threading. Thread): #爬取奇数页内容 def init (self): threading. Thread.init (self) def run (self): for I in Range (1,12,2): url = "https://www.qiushibaike.com/8hr/page/" +str (i) + "/" Pag edata = Urllib.request.urlopen (URL). read (). Decode ("Utf-8", "ignore") pattern = ' <p class= ' content ' >.*?<span > (. *?) </span> (. *?) </p> ' DataList = Re.compile (Pattern,re. S). FindAll (Pagedata) for J in Range (0,len (DataList)): Print ("+str (i) +" page "+str (j) +" Satin Content: ") print (Datalist[j]) CLA SS-Threading. Thread): #爬取奇数页内容 deF init (self): threading. Thread.init (self) def run (self): for I in Range (2,12,2): url = "https://www.qiushibaike.com/8hr/page/" +str (i) + "/" Pag edata = Urllib.request.urlopen (URL). read (). Decode ("Utf-8", "ignore") pattern = ' <p class= ' content ' >.*?<span > (. *?) </span> (. *?) </p> ' DataList = Re.compile (Pattern,re. S). FindAll (Pagedata) for J in Range (0,len (DataList)): Print ("+str (i) +" page "+str (j) +" Satin Content: ") print (Datalist[j]) t1 = one () t2 = one () T1.start () T2.start ()


Above this python multi-threaded crawler Walkthrough _ Crawl embarrassing Encyclopedia of the case is a small part of the whole content to share to everyone, I hope to give you a reference, but also hope that we support topic.alibabacloud.com.

Related recommendations:

Python data structures and algorithms common allocation sorting example "bucket sort and cardinal sort" _python

XLWT setting Excel cell fonts in Python pass-through method

The Python language realizes the example of Baidu speech recognition function

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.