Baidu Post Bar web crawler instance based on Python, python Crawler
This article describes the web crawler of Baidu post bar based on Python. Share it with you for your reference. The details are as follows:
Click here to download the complete instance code.
Project content:
Web Crawler of Baidu Post Bar written in Python.
Usage:
Create a new BugBaidu. py file, copy the code to it, and double-click it to run.
Program functions:
Package the content published by the poster in the Post Bar and store the txt file locally.
Principles:
First, let's look at a certain post. After clicking only the main poster and clicking the second page, the url changes a bit and becomes:
Http://tieba.baidu.com/p/2296712428? See_lz = 1 & pn = 1
As you can see, see_lz = 1 is for the landlord only, and pn = 1 is for the corresponding page number. Remember this for future preparation.
This is the url we need to use.
The next step is to view the page source code.
First, it will be used when the question is extracted to store files.
We can see that Baidu uses gbk encoding and the title is marked with h1:
Copy codeThe Code is as follows:
Similarly, the body is marked with div and class. The next step is to use a regular expression for matching.
Run:
Generated txt file:
#-*-Coding: UTF-8-*-# ------------------------------------- # program: Baidu pub crawler # version: 0.5 # Author: why # Date: 2013-05-16 # language: Python 2.7 # operation: after you enter the URL, you can only view and save it to a local file. # function: Save the content published by the local host as a txt file. # Define import string import urllib2 import re # ----------- process various labels on the page ----------- class HTML_Tool: # use non-Greedy mode to match \ t, \ n, space, hyperlink, or image BgnCharToNoneRex = re. compile ("(\ t | \ n | <. *?> | ) ") # Match any tag in non-Greedy mode <> EndCharToNoneRex = re. compile (" <. *?> ") # Match any <p> label BgnPartRex = re. compile (" <p. *?> ") CharToNewLineRex = re. compile ("(<br/> | </p> | <tr> | <div> | </div>)") CharToNextTabRex = re. compile ("<td>") # convert some html symbolic entities into original symbols replaceTab = [("<", "<"), ("> ", ">"), ("&", "&"), ("&", "\" "), (" "," ")] def Replace_Char (self, x): x = self. bgnCharToNoneRex. sub ("", x) x = self. bgnPartRex. sub ("\ n", x) x = self. charToNewLineRex. sub ("\ n", x) x = self. charToNextTabRex. sub ("\ t", x) x = self. endCharToNoneRex. sub ("", x) f Or t in self. replaceTab: x = x. replace (t [0], t [1]) return x class Baidu_Spider: # declare related attributes def _ init _ (self, url): self. myUrl = url + '? See_lz = 1' self. datas = [] self. myTool = HTML_Tool () print U' has started the Baidu Post Bar crawler. Click '# initialize and load the page and save its transcoding def baidu_tieba (self ): # Read the original page information and transcode it from gbk myPage = urllib2.urlopen (self. myUrl ). read (). decode ("gbk") # calculate the total number of pages of the published content of the main website endPage = self. page_counter (myPage) # obtain the title of this post = self. find_title (myPage) print U' article name: '+ title # obtain final data self. save_data (self. myUrl, title, endPage) # used to calculate the total number of pages def page_counter (self, my Page): # match the "Total <span class =" red "> 12 </span> pages" to obtain the total number of pages myMatch = re. search (r'class = "red"> (\ d + ?) </Span> ', myPage, re. s) if myMatch: endPage = int (myMatch. group (1) print u'crawler Report: found the original content of Page % d '% endPage else: endPage = 0 print U' crawler report: you cannot calculate the number of pages of the published content! 'Return endPage # used to find the post title def find_title (self, myPage ): # match
I hope this article will help you with Python programming.