[Python] web crawler (9): source code and Analysis of Web Crawler (v0.4) of Baidu Post Bar

Source: Internet
Author: User

The crawler production of Baidu Post Bar is basically the same as that of baibai. Key Data is deducted from the source code and stored in the local TXT file.

Project content:

Web Crawler of Baidu Post Bar written in Python.

Usage:

Create a new bugbaidu. py file, copy the code to it, and double-click it to run.

Program functions:

Package the content published by the poster in the Post Bar and store the TXT file locally.

Principles:

First, let's look at a certain post. After clicking only the main poster and clicking the second page, the URL changes a bit and becomes:

Http://tieba.baidu.com/p/2296712428? See_lz = 1 & Pn = 1

As you can see, see_lz = 1 is for the landlord only, and Pn = 1 is for the corresponding page number. Remember this for future preparation.

This is the URL we need to use.

The next step is to view the page source code.

First, it will be used when the question is extracted to store files.

We can see that Baidu uses GBK encoding and the title is marked with H1:

<H1 class = "core_title_txt" Title = "[original] Head of fashion (about fashion, fame and fortune, career, love, inspirational)"> [original] Head of fashion (about fashion, fame and fortune, career, love, inspirational) 


Similarly, the body is marked with Div and class. The next step is to use a regular expression for matching.

Run:

Generated TXT file:


#-*-Coding: UTF-8-*-# ------------------------------------- # program: Baidu pub crawler # version: 0.5 # Author: Why # Date: 2013-05-16 # language: Python 2.7 # operation: after you enter the URL, you can only view and save it to a local file. # function: Save the content published by the local host as a TXT file. # Define import stringimport urllib2import re # ----------- process various labels on the page ----------- class html_tool: # use non-Greedy mode to match \ t, \ n, space, hyperlink, or image bgnchartononerex = Re. compile ("(\ t | \ n | <. *?> | ) ") # Match any tag in non-Greedy mode <> endchartononerex = Re. Compile (" <. *?> ") # Match any <p> label bgnpartrex = Re. Compile (" <p. *?> ") Chartonewlinerex = Re. compile ("(<br/> | </P> | <tr> | <div> | </div>)") chartonexttabrex = Re. compile ("<TD>") # convert some HTML symbolic entities into original symbols replacetab = [("<", "<"), ("> ", ">"), ("&", "&"), ("&", "\" "), (" "," ")] def replace_char (self, x): x = self. bgnchartononerex. sub ("", x) x = self. bgnpartrex. sub ("\ n", x) x = self. chartonewlinerex. sub ("\ n", x) x = self. chartonexttabrex. sub ("\ t", x) x = self. endchartononerex. sub ("", x) f Or t in self. replacetab: x = x. replace (T [0], t [1]) return X class baidu_spider: # declare related attributes def _ init _ (self, URL): Self. myurl = URL + '? See_lz = 1' self. datas = [] self. mytool = html_tool () print U' has started the Baidu Post Bar crawler. Click '# initialize and load the page and save its transcoding def baidu_tieba (Self ): # Read the original page information and transcode it from GBK mypage = urllib2.urlopen (self. myurl ). read (). decode ("GBK") # calculate the total number of pages of the published content of the main website endpage = self. page_counter (mypage) # obtain the title of this post = self. find_title (mypage) print U' article name: '+ title # obtain final data self. save_data (self. myurl, title, endpage) # used to calculate the total number of pages def page_counter (self, my Page): # match the "Total <SPAN class =" red "> 12 </span> pages" to obtain the total number of pages mymatch = Re. search (r'class = "red"> (\ D + ?) </Span> ', mypage, re. s) If mymatch: endpage = int (mymatch. group (1) print u'crawler Report: found the original content of Page % d '% endpage else: endpage = 0 print U' crawler report: you cannot calculate the number of pages of the published content! 'Return endpage # used to find the post title def find_title (self, mypage ): # match <H1 class = "core_title_txt" Title = ""> xxxxxxxxxx 
Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.