Importos;Importurllib.request;Importre;Importthreading;#Multithreading fromUrllib.errorImportUrlerror#Receive Exception ' s module#get the source code of the websiteclassQsspider:#Init initializes the constructor. Self itself def __init__(self): self.user_agent='mozilla/5.0 (Windows NT 10.0; WOW64) applewebkit/537.36 (khtml, like Gecko) chrome/63.0.3239.84 safari/537.36'Self.header= {'user-agent': self.user_agent} self.url='http://www.qiubaichengren.net/%s.html'Self.save_dir='./img'Self.page_num= 20#Page Num#Get site Source code defload_html (self,page):Try: Web_path= Self.url%page Request= Urllib.request.Request (web_path,headers=Self.header) with Urllib.request.urlopen (request) as F:html_content= F.read (). Decode ('GBK') #print (html_content)self.pick_pic (html_content)exceptUrlerror as E:Print(E.reason)#cause of the exception #Download defsava_pic (self,img): Save_path= Self.save_dir +"/"+img.replace (':','@'). Replace ('/','_') if notos.path.exists (Self.save_dir): Os.makedirs (Self.save_dir)Print(Save_path) urllib.request.urlretrieve (Img,save_path)#Filter defpick_pic (self,html_content): Patren= Re.compile (r'src= "(http:.*?\. (?: jpg|png|gif))') Pic_path_list=Patren.findall (html_content) forIinchpic_path_list:#print (i)self.sava_pic (str (i))#Mamy Threading defStart (self): forIinchRange (1, Self.page_num): Thread= Threading. Thread (target=self.load_html,args=str (i)) Thread.Start ()#main voidSpider =Qsspider () Spider.start ( )
First, the crawler process:
1. Initiating the request
Using the HTTP library to initiate a request to the target site,
Requests include: request header, request body, etc.
Request Module Defect: cannot execute JS and CSS code
2. Get Response Content
If the server responds properly, it will get a response
Response includes: Html,json, pictures, videos, etc.
3. Parsing content
Parsing HTML data: Regular expressions (re modules), third-party parsing libraries such as beautifulsoup,pyquery, etc.
Parsing JSON data: JSON module
Parsing binary data: Writing files in WB mode
4. Save data
Database (Mysql,mongdb, Redis)
File
Second, response response
1. Response Status Code
200: On behalf of success
301: Rep Jump
404: File does not exist
403: No Access
502: Server Error
Iii. HTTP protocol Request and response
Request: The user sends their own information to the server (socket server) via the browser (socket client)
Response: The server receives the request, parses the request information from the user, and then returns the data (the returned data may contain other links, such as: pictures, js,css, etc.)
PS: After receiving response, the browser will parse its contents to display to the user, and the crawler can extract useful data from the browser after it sends the request and receives response.
Iv. Results (Benefits)
Use Python to crawl the embarrassing web benefits (multithreading, urllib, request)