Using Python to write Baidu paste stick crawler

Last Update:2016-06-24 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Recently, we need to do an anti-crawler and machine spider puzzle, feeling confused, a bit impossible to start, rather, directly with Python to write a spiner understand its various principles, and then do not later Ah, so immediately to write a crawler program.

How to use:

After you create a new bugbaidu.py file, and then copy the code inside, double-click Run.

Program function:

Paste the content of the landlord posted txt stored on-premises.

Good, no nonsense, directly on the code:

#!/usr/bin/python#-*- coding: utf-8 -*-import stringimport urllib2import re#  -----------  handles various labels on the page  -----------class html_tool:    #  with non-  Greedy mode   Match  \t  or  \n  or   spaces   or   hyperlinks   or   images      Bgnchartononerex = re.compile ("(\t|\n| |<a.*?>|)")     #   Non-  greedy mode   match   any <> tag     endchartononerex = re.compile (" <.*?> ")     #  with non-  greedy mode   match   any <p> tag      Bgnpartrex = re.compile ("<p.*?>")     CharToNewLineRex =  Re.compile ("(<br/>|</p>|<tr>|<div>|</div>)")      Chartonexttabrex = re.compile ("<td>")     #  convert some of the HTML symbol entities to original symbols      replacetab =  [("<",  "<"),  (">",  ">"),  ("&",  "&"),  ("&",  "\"),   (" ",  " ")]    def replace_char (self, x):         x = self. Bgnchartononerex.sub ("",  x)         x = self. Bgnpartrex.sub ("\n    ",  x)         x =  self. Chartonewlinerex.sub ("\ n",  x)         x = self. Chartonexttabrex.sub ("\ t",  x)         x = self. Endchartononerex.sub ("",  x)         for t in  self.replacetab:            x =  X.replace (t[0], t[1])         return xclass yzw_spider:    #  affirms the associated attribute     def __init__ (Self, url):         self.myUrl = url +  '? see_lz=1 '          self.datas = []         Self.mytool = html_tool ()         print u ' has started the crawler, clicks the "         #  Initialize load page and transcode it to save     def yzw_ Tieba (self):        #  read the original information from the page and transcode it from Utf-8          mypage = urllib2.urlopen (Self.myurl). Read (). Decode ("Utf-8")          #  calculate how many pages the landlord published content          endpage = self.page_counter (MyPage)         #  Gets the title of the post       &nbsP; title = self.find_title (mypage)         print u ' Article name: '  + title        #  get the final data          self.save_data (self.myurl, title, endpage)          #  used to calculate the total number of pages     def page_counter (self, mypage):         #  matching   "Total <span class=" Red >12</span> page   to get a total of how many pages         mymatch = re.search (R ' class= "Red" > (\d+?) </span> ',  mypage, re. S)         if myMatch:             endpage = int (Mymatch.group (1))              print u ' crawler report: Found landlord total%d pages of original content '  % endpage        else:             endpage = 0            print  u ' Crawler report: Unable to calculate how many pages the landlord published content! '         return endPage         #  used to find the title of the Post     def find_title (self, mypage):         #  match  
Finally, a. txt document was drawn, but there was something I needed to crawl down,
650) this.width=650; "Src=" Http://s3.51cto.com/wyfs02/M00/83/30/wKioL1dswuKi2EejAACbdXFGn5c989.png-wh_500x0-wm_3 -wmp_4-s_2558785298.png "title=" Qq20160624131837.png "alt=" Wkiol1dswuki2eejaacbdxfgn5c989.png-wh_50 "/>

Just like this, a reptile program is OK.
This article is from the "Microsoft" blog, so be sure to keep this source http://1238306.blog.51cto.com/1228306/1792536
Using Python to write Baidu paste stick crawler

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Using Python to write Baidu paste stick crawler

Contact Us

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support