Recently, we need to do an anti-crawler and machine spider puzzle, feeling confused, a bit impossible to start, rather, directly with Python to write a spiner understand its various principles, and then do not later Ah, so immediately to write a crawler program.
How to use:
After you create a new bugbaidu.py file, and then copy the code inside, double-click Run.
Program function:
Paste the content of the landlord posted txt stored on-premises.
Good, no nonsense, directly on the code:
#!/usr/bin/python#-*- coding: utf-8 -*-import stringimport urllib2import re# ----------- handles various labels on the page -----------class html_tool: # with non- Greedy mode Match \t or \n or spaces or hyperlinks or images Bgnchartononerex = re.compile ("(\t|\n| |<a.*?>|)") # Non- greedy mode match any <> tag endchartononerex = re.compile (" <.*?> ") # with non- greedy mode match any <p> tag Bgnpartrex = re.compile ("<p.*?>") CharToNewLineRex = Re.compile ("(<br/>|</p>|<tr>|<div>|</div>)") Chartonexttabrex = re.compile ("<td>") # convert some of the HTML symbol entities to original symbols replacetab = [("<", "<"), (">", ">"), ("&", "&"), ("&", "\"), (" ", " ")] def replace_char (self, x): x = self. Bgnchartononerex.sub ("", x) x = self. Bgnpartrex.sub ("\n ", x) x = self. Chartonewlinerex.sub ("\ n", x) x = self. Chartonexttabrex.sub ("\ t", x) x = self. Endchartononerex.sub ("", x) for t in self.replacetab: x = X.replace (t[0], t[1]) return xclass yzw_spider: # affirms the associated attribute def __init__ (Self, url): self.myUrl = url + '? see_lz=1 ' self.datas = [] Self.mytool = html_tool () print u ' has started the crawler, clicks the " # Initialize load page and transcode it to save def yzw_ Tieba (self): # read the original information from the page and transcode it from Utf-8 mypage = urllib2.urlopen (Self.myurl). Read (). Decode ("Utf-8") # calculate how many pages the landlord published content endpage = self.page_counter (MyPage) # Gets the title of the post &nbsP; title = self.find_title (mypage) print u ' Article name: ' + title # get the final data self.save_data (self.myurl, title, endpage) # used to calculate the total number of pages def page_counter (self, mypage): # matching "Total <span class=" Red >12</span> page to get a total of how many pages mymatch = re.search (R ' class= "Red" > (\d+?) </span> ', mypage, re. S) if myMatch: endpage = int (Mymatch.group (1)) print u ' crawler report: Found landlord total%d pages of original content ' % endpage else: endpage = 0 print u ' Crawler report: Unable to calculate how many pages the landlord published content! ' return endPage # used to find the title of the Post def find_title (self, mypage): # match Finally, a. txt document was drawn, but there was something I needed to crawl down,
650) this.width=650; "Src=" Http://s3.51cto.com/wyfs02/M00/83/30/wKioL1dswuKi2EejAACbdXFGn5c989.png-wh_500x0-wm_3 -wmp_4-s_2558785298.png "title=" Qq20160624131837.png "alt=" Wkiol1dswuki2eejaacbdxfgn5c989.png-wh_50 "/>
Just like this, a reptile program is OK.
This article is from the "Microsoft" blog, so be sure to keep this source http://1238306.blog.51cto.com/1228306/1792536
Using Python to write Baidu paste stick crawler