The crawler production of Baidu Post Bar is basically the same as that of baibai. key data is deducted from the source code and stored in the local txt file. The crawler production of Baidu Post Bar is basically the same as that of baibai. key data is deducted from the source code and stored in the local txt file.
Download source code:
Http://download.csdn.net/detail/wxg694175346/6925583
Project content:
Web crawler of Baidu Post Bar written in Python.
Usage:
Create a new BugBaidu. py file, copy the code to it, and double-click it to run.
Program functions:
Package the content published by the poster in the post bar and store the txt file locally.
Principles:
First, let's look at a certain post. after clicking only the main poster and clicking the second page, the url changes a bit and becomes:
Http://tieba.baidu.com/p/2296712428? See_lz = 1 & pn = 1
As you can see, see_lz = 1 is for the landlord only, and pn = 1 is for the corresponding page number. remember this for future preparation.
This is the url we need to use.
The next step is to view the page source code.
First, it will be used when the question is extracted to store files.
We can see that Baidu uses gbk encoding and the title is marked with h1:
[Original] Fashion Director (about fashion, fame and fortune, career, love, inspirational)
Similarly, p and class are used to mark the body. The next step is to use a regular expression to match the body.
Run:
#-*-Coding: UTF-8-*-# ------------------------------------- # Program: Baidu pub crawler # Version: 0.5 # Author: why # Date: 2013-05-16 # Language: Python 2.7 # operation: after you enter the URL, you can only view and save it to a local file. # function: save the content published by the local host as a txt file. # Define import string import urllib2 import re # ----------- process various labels on the page ----------- class HTML_Tool: # use non-greedy mode to match \ t, \ n, space, hyperlink, or image BgnCharToNoneRex = re. compile ("(\ t | \ n |)") # use non-greedy mode to match any <> tag EndCharToNoneRex = re. compile ("<. *?> ") # Match arbitrary data in non-greedy modeLabel BgnPartRex = re. compile ("
") CharToNewLineRex = re. compile ("(
|
|||
) ") CharToNextTabRex = re. compile ("") # Convert some html symbolic entities into original symbols replaceTab = [(" <"," <"), ("> ","> "), ("&", "&"), ("&", "\" "), (" "," ")] def Replace_Char (self, x ): x = self. bgnCharToNoneRex. sub ("", x) x = self. bgnPartRex. sub ("\ n", x) x = self. charToNewLineRex. sub ("\ n", x) x = self. charToNextTabRex. sub ("\ t", x) x = self. endCharToNoneRex. sub ("", x) for t in self. replaceTab: x = x. replace (t [0], t [1]) return x class Baidu_Spider: # declare related attributes def _ init __( Self, url): self. myUrl = url + '? See_lz = 1' self. datas = [] self. myTool = HTML_Tool () print U' has started the Baidu Post bar crawler. click '# initialize and load the page and save its transcoding def baidu_tieba (self ): # read the original page information and transcode it from gbk myPage = urllib2.urlopen (self. myUrl ). read (). decode ("gbk") # calculate the total number of pages of the published content of the main website endPage = self. page_counter (myPage) # obtain the title of this post = self. find_title (myPage) print U' article name: '+ title # obtain final data self. save_data (self. myUrl, title, endPage) # used to calculate the total number of pages def page_counter (self, my Page): # Match "12 pages in total" to obtain the total number of pages myMatch = re. search (r'class = "red"> (\ d + ?) ', MyPage, re. s) if myMatch: endPage = int (myMatch. group (1) print u'crawler report: found the original content of Page % d '% endPage else: endPage = 0 print U' crawler report: you cannot calculate the number of pages of the published content! 'Return endPage # used to find the title def find_title (self, myPage) of this post: # match xxxxxxxx to find the title myMatch = re. search (r '(.*?) ', MyPage, re. S) title = U' no title now' if myMatch: title = myMatch. group (1) else: print U' crawler report: unable to load article title! '# The file name cannot contain the following characters :\/:*? "<> | Title = title. replace ('\\',''). replace ('/',''). replace (':',''). replace ('*',''). replace ('? ',''). Replace ('"',''). replace ('> ',''). replace ('<',''). replace ('|', '') return title # used to store the content def save_data (self, url, title, endPage) published by the landlord: # load page data to the array self. get_data (url, endPage) # open the local file f = open(title+'.txt ', 'W +') f. writelines (self. datas) f. close () print u'crawler report: The file has been downloaded to a local file and packaged into a txt file 'print U'. press any key to exit... 'raw_input (); # obtain the page source code and store it in the array def get_data (self, url, endPage ): url = url + '& pn =' for I in range (1, en DPage + 1): print u'crawler report: crawler % d loading... '% I myPage = urllib2.urlopen (url + str (I )). read () # process html code in myPage and store it in datas self. deal_data (myPage. decode ('gbk') # extract the content from the page code def deal_data (self, myPage): myItems = re. findall ('Id = "post_content. *?> (.*?)', MyPage, re. s) for item in myItems: data = self. myTool. replace_Char (item. replace ("\ n ",""). encode ('gbk') self. datas. append (data + '\ n') # -------- program entrance -------------------- print u "" # --------------------------------------- # Program: Baidu Post it crawler # Version: 0.5 # Author: why # Date: # Language: Python 2.7 # operation. # ----------------------------------- "# Use a novel as an example # bdurl = 'http: // tieba.baidu.com/p/2296712428? See_lz = 1 & pn = 1 'print U' enter the numeric string at the end of the address: 'bdurl = 'http: // tieba.baidu.com/p/' + str (raw_input (u 'http: // tieba.baidu.com/p/') # Call mySpider = Baidu_Spider (bdurl) mySpider. baidu_tieba ()
The above is the [Python] web crawler (9): Web crawler of Baidu Post Bar (v0.4) source code and parsed content. For more information, see PHP Chinese Web (www.php1.cn )!