Based on python implementation of Baidu paste Network crawler example _python

Source: Internet
Author: User

This article describes the implementation of the python based on the Baidu paste network crawler. Share to everyone for your reference. Specifically as follows:

Full instance code click here to download the site.

Project content:

Use Python to write the web crawler Baidu Bar.

How to use:

Create a new bugbaidu.py file, and then copy the code into it, and then double-click to run it.

Program function:

Post the content of the landlord posted in packaging txt stored to local.

Principle Explanation:

First, first glance at a bar, click on the landlord and click on the second page after a little change in the URL, has become:
Http://tieba.baidu.com/p/2296712428?see_lz=1&pn=1

Can be seen, see_lz=1 is only to see the landlord, Pn=1 is the corresponding page number, remember this for future preparation.
This is the URL we need to take advantage of.
The next step is to view the page source.
First of all, the problem is dug out to store files when it will be used.
You can see that Baidu uses the GBK code, the title uses the H1 mark:

Copy Code code as follows:
<H1 class= "Core_title_txt" title= "original" fashion chief (about fashion, fame, career, love, inspirational) "> Original" Fashion Chief (about fashion, fame, career, love, inspirational)

Similarly, the body part with div and class composite tags, the next thing to do is to use regular expressions to match.

Run Screenshots:

Generated TXT file:

#-*-Coding:utf-8-*-#---------------------------------------# program: Baidu paste Crawler # version: 0.5 # Author: Why # Date: 2013-05-16 
# language: Python 2.7 # Operation: After entering the URL automatically only look at the landlord and save to the local file # function: The content of the landlord published packaging txt stored to local. #---------------------------------------Import string Import urllib2 import re #-----------processing various labels on the page-----------C Lass Html_tool: # match T or \ n or a space or a hyperlink or picture with a non-greedy pattern Bgnchartononerex = re.compile ("\t|\n| |<a.*?>|<im 
  g.*?>) "# to match any <> tag with non-greedy pattern Endchartononerex = re.compile (" <.*?> ") # match any <p> tag with a non-greedy pattern Bgnpartrex = Re.compile ("<p.*?>") Chartonewlinerex = Re.compile ("<br/>|</p>|<tr>|<div >|</div>) "Chartonexttabrex = Re.compile (" <td> ") # converts some HTML symbol entities into the original symbol Replacetab = [(" < "," ; "), (" > "," > "), (" & "," & "), (" & "," \ "), (" "," ")] def Replace_char (self,x): x = self. Bgnchartononerex.sub ("", x) x = self. Bgnpartrex.sub ("\ n", x) x = self. ChartonewlinereX.sub ("\ n", x) x = self. Chartonexttabrex.sub ("T", x) x = self. Endchartononerex.sub ("", X) for T in self.replacetab:x = X.replace (t[0],t[1)) return X class baidu_sp  Ider: # Declare related attribute def __init__ (self,url): Self.myurl = URL + '? see_lz=1 ' Self.datas = [] Self.mytool = Html_tool () Print U ' has started Baidu post-stick crawler, clicks ' # Initialize Load page and store def Baidu_tieba (self): # Read the original information of the page and transfer it from the GBK code my 
    page = Urllib2.urlopen (self.myurl). Read (). Decode ("GBK") # Calculate how many pages a landlord has published endpage = Self.page_counter (myPage) # get the title of the post title = Self.find_title (myPage) print U ' article name: ' + title # Get the final data self.save_data (self.myurl , Title,endpage) #用来计算一共有多少页 def page_counter (self,mypage): # matching "Total <span class=" Red ">12</span> page" to obtain How many pages are there mymatch = Re.search (R ' class= "Red" > (\d+?) </span> ', MyPage, re. S) If mymatch:endpage = Int (Mymatch.group (1)) Print U ' crawler report: Found landlord total%d page original content '% EndPage else:endpage = 0 Print u ' Crawler report: Unable to calculate the content of the landlord published how many pages! ' Return EndPage # used to find the title Def find_title (self,mypage): # match 

I hope this article will help you with your Python programming.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.