Based on python implementation of Baidu paste Network crawler example

Based on python implementation of Baidu paste Network crawler example _python

Last Update:2017-01-19 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

This article describes the implementation of the python based on the Baidu paste network crawler. Share to everyone for your reference. Specifically as follows:

Full instance code click here to download the site.

Project content:

Use Python to write the web crawler Baidu Bar.

How to use:

Create a new bugbaidu.py file, and then copy the code into it, and then double-click to run it.

Program function:

Post the content of the landlord posted in packaging txt stored to local.

Principle Explanation:

First, first glance at a bar, click on the landlord and click on the second page after a little change in the URL, has become:
Http://tieba.baidu.com/p/2296712428?see_lz=1&pn=1

Can be seen, see_lz=1 is only to see the landlord, Pn=1 is the corresponding page number, remember this for future preparation.
This is the URL we need to take advantage of.
The next step is to view the page source.
First of all, the problem is dug out to store files when it will be used.
You can see that Baidu uses the GBK code, the title uses the H1 mark:

Copy Code code as follows:

<H1 class= "Core_title_txt" title= "original" fashion chief (about fashion, fame, career, love, inspirational) "> Original" Fashion Chief (about fashion, fame, career, love, inspirational)

Similarly, the body part with div and class composite tags, the next thing to do is to use regular expressions to match.

Run Screenshots:

Generated TXT file:

#-*-Coding:utf-8-*-#---------------------------------------# program: Baidu paste Crawler # version: 0.5 # Author: Why # Date: 2013-05-16 
# language: Python 2.7 # Operation: After entering the URL automatically only look at the landlord and save to the local file # function: The content of the landlord published packaging txt stored to local. #---------------------------------------Import string Import urllib2 import re #-----------processing various labels on the page-----------C Lass Html_tool: # match T or \ n or a space or a hyperlink or picture with a non-greedy pattern Bgnchartononerex = re.compile ("\t|\n| |<a.*?>|<im 
  g.*?>) "# to match any <> tag with non-greedy pattern Endchartononerex = re.compile (" <.*?> ") # match any <p> tag with a non-greedy pattern Bgnpartrex = Re.compile ("<p.*?>") Chartonewlinerex = Re.compile ("<br/>|</p>|<tr>|<div >|</div>) "Chartonexttabrex = Re.compile (" <td> ") # converts some HTML symbol entities into the original symbol Replacetab = [(" < "," ; "), (" > "," > "), (" & "," & "), (" & "," \ "), (" "," ")] def Replace_char (self,x): x = self. Bgnchartononerex.sub ("", x) x = self. Bgnpartrex.sub ("\ n", x) x = self. ChartonewlinereX.sub ("\ n", x) x = self. Chartonexttabrex.sub ("T", x) x = self. Endchartononerex.sub ("", X) for T in self.replacetab:x = X.replace (t[0],t[1)) return X class baidu_sp  Ider: # Declare related attribute def __init__ (self,url): Self.myurl = URL + '? see_lz=1 ' Self.datas = [] Self.mytool = Html_tool () Print U ' has started Baidu post-stick crawler, clicks ' # Initialize Load page and store def Baidu_tieba (self): # Read the original information of the page and transfer it from the GBK code my 
    page = Urllib2.urlopen (self.myurl). Read (). Decode ("GBK") # Calculate how many pages a landlord has published endpage = Self.page_counter (myPage) # get the title of the post title = Self.find_title (myPage) print U ' article name: ' + title # Get the final data self.save_data (self.myurl , Title,endpage) #用来计算一共有多少页 def page_counter (self,mypage): # matching "Total <span class=" Red ">12</span> page" to obtain How many pages are there mymatch = Re.search (R ' class= "Red" > (\d+?) </span> ', MyPage, re. S) If mymatch:endpage = Int (Mymatch.group (1)) Print U ' crawler report: Found landlord total%d page original content '% EndPage else:endpage = 0 Print u ' Crawler report: Unable to calculate the content of the landlord published how many pages! ' Return EndPage # used to find the title Def find_title (self,mypage): # match

I hope this article will help you with your Python programming.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Based on python implementation of Baidu paste Network crawler example _python

Contact Us

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support