Using Python to write Baidu paste stick crawler

Source: Internet
Author: User

Recently, we need to do an anti-crawler and machine spider puzzle, feeling confused, a bit impossible to start, rather, directly with Python to write a spiner understand its various principles, and then do not later Ah, so immediately to write a crawler program.

How to use:

After you create a new bugbaidu.py file, and then copy the code inside, double-click Run.

Program function:

Paste the content of the landlord posted txt stored on-premises.

Good, no nonsense, directly on the code:

#!/usr/bin/python#-*- coding: utf-8 -*-import stringimport urllib2import re#  -----------  handles various labels on the page  -----------class html_tool:    #  with non-  Greedy mode   Match  \t  or  \n  or   spaces   or   hyperlinks   or   images      Bgnchartononerex = re.compile ("(\t|\n| |<a.*?>|)")     #   Non-  greedy mode   match   any <> tag     endchartononerex = re.compile (" <.*?> ")     #  with non-  greedy mode   match   any <p> tag      Bgnpartrex = re.compile ("<p.*?>")     CharToNewLineRex =  Re.compile ("(<br/>|</p>|<tr>|<div>|</div>)")      Chartonexttabrex = re.compile ("<td>")     #  convert some of the HTML symbol entities to original symbols      replacetab =  [("<",  "<"),  (">",  ">"),  ("&",  "&"),  ("&",  "\"),   (" ",  " ")]    def replace_char (self, x):         x = self. Bgnchartononerex.sub ("",  x)         x = self. Bgnpartrex.sub ("\n    ",  x)         x =  self. Chartonewlinerex.sub ("\ n",  x)         x = self. Chartonexttabrex.sub ("\ t",  x)         x = self. Endchartononerex.sub ("",  x)         for t in  self.replacetab:            x =  X.replace (t[0], t[1])         return xclass yzw_spider:    #  affirms the associated attribute     def __init__ (Self, url):         self.myUrl = url +  '? see_lz=1 '          self.datas = []         Self.mytool = html_tool ()         print u ' has started the crawler, clicks the "         #  Initialize load page and transcode it to save     def yzw_ Tieba (self):        #  read the original information from the page and transcode it from Utf-8          mypage = urllib2.urlopen (Self.myurl). Read (). Decode ("Utf-8")          #  calculate how many pages the landlord published content          endpage = self.page_counter (MyPage)         #  Gets the title of the post       &nbsP; title = self.find_title (mypage)         print u ' Article name: '  + title        #  get the final data          self.save_data (self.myurl, title, endpage)          #  used to calculate the total number of pages     def page_counter (self, mypage):         #  matching   "Total <span class=" Red >12</span> page   to get a total of how many pages         mymatch = re.search (R ' class= "Red" > (\d+?) </span> ',  mypage, re. S)         if myMatch:             endpage = int (Mymatch.group (1))              print u ' crawler report: Found landlord total%d pages of original content '  % endpage        else:             endpage = 0            print  u ' Crawler report: Unable to calculate how many pages the landlord published content! '         return endPage         #  used to find the title of the Post     def find_title (self, mypage):         #  match  

Finally, a. txt document was drawn, but there was something I needed to crawl down,

650) this.width=650; "Src=" Http://s3.51cto.com/wyfs02/M00/83/30/wKioL1dswuKi2EejAACbdXFGn5c989.png-wh_500x0-wm_3 -wmp_4-s_2558785298.png "title=" Qq20160624131837.png "alt=" Wkiol1dswuki2eejaacbdxfgn5c989.png-wh_50 "/>


Just like this, a reptile program is OK.

This article is from the "Microsoft" blog, so be sure to keep this source http://1238306.blog.51cto.com/1228306/1792536

Using Python to write Baidu paste stick crawler

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.