Baidu paste the reptile production and embarrassing hundred of the reptile production principle is basically the same, all by viewing the source key data deducted, and then stored to a local TXT file.
SOURCE Download:
http://download.csdn.net/detail/wxg694175346/6925583
Project content:
Written in Python, Baidu paste the Web crawler.
How to use:
After you create a new bugbaidu.py file, and then copy the code inside, double-click Run.
Program function:
Paste the content of the landlord posted txt stored on-premises.
Explanation of principle:
First of all, first glance at a bar, click on the landlord and click on the second page after the URL has changed a little, became:
Http://tieba.baidu.com/p/2296712428?see_lz=1&pn=1
Can be seen, see_lz=1 is only to see the landlord, Pn=1 is the corresponding page number, remember this for future preparation.
This is the URL we need to use.
Next is to view the source of the page.
The first thing to do is to pick out the problem and store it.
Can see Baidu using GBK encoding, the title uses H1 tag:
<H1 class= "Core_title_txt" title= "original" fashion chief (about fashion, fame, career, love, inspirational) ">" Original "Fashion chief (about fashion, fame, career, love, inspirational)
Similarly, the body part uses the DIV and class synthetic tags, the next thing to do is to use regular expressions to match.
Run:
Generated TXT file:
#-*-Coding:utf-8-*-#---------------------------------------# program: Baidu paste Stick Crawler # version: 0.5 # Author: Why # Date: 2013-05-1 6 # language: Python 2.7 # Action: After entering the URL automatically only see the landlord and save to the local file # function: the landlord published Content packaging txt store to the local. #---------------------------------------Import string Import urllib2 import re #-----------handle the various labels on the page--------- --Class Html_tool: # matches a non-greedy pattern with \ t or \ n or a space or a hyperlink or picture Bgnchartononerex = Re.compile ("(\t|\n| |<A.*?&G t;|) # matches any <> tag with non-greedy mode Endchartononerex = Re.compile ("<.*?>") # with non-greedy Pattern matches any <p> tag Bgnpartrex = Re.compile ("<p.*?>") Chartonewlinerex = Re.compile ("(<BR/>|</P&G t;|<tr>|<div>|</div>) Chartonexttabrex = Re.compile ("<td>") # turns some of the HTML symbol entities into original symbols Replacetab = [("<", "<"), (">", ">"), ("&", "&"), ("&", "\" "), (" "," ")] def Replace_char (s ELF,X): x = self. Bgnchartononerex.sub ("", x) x = SELf. Bgnpartrex.sub ("\ n", x) x = self. Chartonewlinerex.sub ("\ n", x) x = self. Chartonexttabrex.sub ("\ T", x) x = self. Endchartononerex.sub ("", X) for T in self.replacetab:x = X.replace (t[0],t[1]) return X class Baidu_spider: # Affirms related properties def __init__ (self,url): Self.myurl = URL + '? see_lz=1 ' Self.datas = [] Self.mytool = Html_tool () print U ' has started Baidu post-bar crawler, click ' # Initialize the load page and transcode it to store de F Baidu_tieba (self): # reads the original information from the page and converts it from GBK mypage = Urllib2.urlopen (Self.myurl). Read (). Decode ("GBK") # Calculate how many pages the landlord published content EndPage = Self.page_counter (mypage) # Gets the title of the post title = Self.find_title (MyP Age) Print U ' article name: ' + title # get Final Data self.save_data (self.myurl,title,endpage) #用来计算一共有多 Less page def page_counter (self,mypage): # match "<span class=" Red >12</span> page "to get a total of how many pages Mymat CH = Re.search (R ' class= "Red" > (\d+?) </span> ', MyPage, re. S) If mymatch:endpage = Int (Mymatch.group (1)) Print U ' crawler report: Found landlord total%d pages of original content '% EndPage else:endpage = 0 Print u ' Crawler report: Unable to calculate how many pages the landlord published content! ' Return EndPage # used to find the title of the Post Def Find_title (self,mypage): # match
The above is [Python] web crawler (ix): Baidu paste the Web crawler (v0.4) source and parsing content, more relevant content please pay attention to topic.alibabacloud.com (www.php.cn)!