Crawlers of zero-basic writing python crawlers crawl Baidu posts and store them to the local txt file ultimate version,
The crawler production of Baidu Post Bar is basically the same as that of baibai. Key Data is deducted from the source code and stored in the local txt file.
Project content:
Web Crawler of Baidu Post Bar written in Python.
Usage:
Create a new BugBaidu. py file, copy the code to it, and double-click it to run.
Program functions:
Package the content published by the poster in the Post Bar and store the txt file locally.
Principles:
First, let's look at a certain post. After clicking only the main poster and clicking the second page, the url changes a bit and becomes:
Http://tieba.baidu.com/p/2296712428? See_lz = 1 & pn = 1
As you can see, see_lz = 1 is for the landlord only, and pn = 1 is for the corresponding page number. Remember this for future preparation.
This is the url we need to use.
The next step is to view the page source code.
First, it will be used when the question is extracted to store files.
We can see that Baidu uses gbk encoding and the title is marked with h1:
Copy codeThe Code is as follows:
<H1 class = "core_title_txt" title = "[original] Head of fashion (about fashion, fame and fortune, career, love, inspirational)"> [original] Head of fashion (about fashion, fame and fortune, career, love, inspirational)
Similarly, the body is marked with div and class. The next step is to use a regular expression for matching.
Run:
Generated txt file:
Copy codeThe Code is as follows:
#-*-Coding: UTF-8 -*-
#---------------------------------------
# Program: Baidu Post Bar Crawler
# Version 0.5
# Author: why
# Date: 2013-05-16
# Programming language: Python 2.7
# Operation: After the URL is entered, only the owner is automatically displayed and saved to a local file.
# Function: Package and store the content published by the landlord in a txt file.
#---------------------------------------
Import string
Import urllib2
Import re
# ----------- Process various labels on the page -----------
Class HTML_Tool:
# Match \ t, \ n, space, hyperlink, or image in non-Greedy Mode
BgnCharToNoneRex = re. compile ("(\ t | \ n | <a. *?> | ) ")
# Match any tag in non-Greedy mode <>
EndCharToNoneRex = re. compile ("<. *?> ")
# Match any <p> tag in non-Greedy Mode
BgnPartRex = re. compile ("<p. *?> ")
CharToNewLineRex = re. compile ("(<br/> | </p> | <tr> | <div> | </div> )")
CharToNextTabRex = re. compile ("<td> ")
# Convert some html symbolic entities into original symbols
ReplaceTab = [("<", "<"), (">", "> "),("&","&"),("&", "\" "), (" "," ")]
Def Replace_Char (self, x ):
X = self. BgnCharToNoneRex. sub ("", x)
X = self. BgnPartRex. sub ("\ n", x)
X = self. CharToNewLineRex. sub ("\ n", x)
X = self. CharToNextTabRex. sub ("\ t", x)
X = self. EndCharToNoneRex. sub ("", x)
For t in self. replaceTab:
X = x. replace (t [0], t [1])
Return x
Class Baidu_Spider:
# Declaring related attributes
Def _ init _ (self, url ):
Self. myUrl = url + '? See_lz = 1'
Self. datas = []
Self. myTool = HTML_Tool ()
Print u'baidu Post Bar crawlers have been started, click and click'
# Initialize and load the page and save it for Transcoding
Def baidu_tieba (self ):
# Read the original page information and transcode it from gbk
MyPage = urllib2.urlopen (self. myUrl). read (). decode ("gbk ")
# Calculate the total number of pages published by the poster
EndPage = self. page_counter (myPage)
# Obtain the title of the post
Title = self. find_title (myPage)
Print u'article name: '+ title
# Obtain final data
Self. save_data (self. myUrl, title, endPage)
# Used to calculate the total number of pages
Def page_counter (self, myPage ):
# Match the "Total <span class =" red "> 12 </span> pages" to obtain the total number of pages
MyMatch = re. search (r'class = "red"> (\ d + ?) </Span> ', myPage, re. S)
If myMatch:
EndPage = int (myMatch. group (1 ))
Print u'crawler Report: found the original content of Page % d '% endPage
Else:
EndPage = 0
Print u'crawler Report: you cannot calculate the number of pages published by the poster! '
Return endPage
# Used to find the title of the post
Def find_title (self, myPage ):
# Match
MyMatch = re. search (R' Title = u'no title'
If myMatch:
Title = myMatch. group (1)
Else:
Print u'crawler Report: Unable to load article title! '
# The file name cannot contain the following characters :\/:*? "<> |
Title = title. replace ('\\',''). replace ('/',''). replace (':',''). replace ('*',''). replace ('? ',''). Replace ('"',''). replace ('> ',''). replace ('<',''). replace ('| ','')
Return title
# Used to store the content published by the poster
Def save_data (self, url, title, endPage ):
# Load page data to the array
Self. get_data (url, endPage)
# Open a local file
F = open(titlepolic'.txt ', 'W + ')
F. writelines (self. datas)
F. close ()
Print u'crawler report: the file has been downloaded locally and packaged as a txt file'
Print U' press any key to exit ...'
Raw_input ();
# Obtain the page source code and store it in an array
Def get_data (self, url, endPage ):
Url = url + '& pn ='
For I in range (1, endPage + 1 ):
Print u'crawler Report: crawler % d loading... '% I
MyPage = urllib2.urlopen (url + str (I). read ()
# Process html code in myPage and store it in datas
Self. deal_data (myPage. decode ('gbk '))
# Extract the content from the page code
Def deal_data (self, myPage ):
MyItems = re. findall ('Id = "post_content. *?> (.*?) </Div> ', myPage, re. S)
For item in myItems:
Data = self. myTool. Replace_Char (item. replace ("\ n", ""). encode ('gbk '))
Self. datas. append (data + '\ n ')
# -------- Program entrance ------------------
Print u """#---------------------------------------
# Program: Baidu Post Bar Crawler
# Version 0.5
# Author: why
# Date: 2013-05-16
# Programming language: Python 2.7
# Operation: After the URL is entered, only the owner is automatically displayed and saved to a local file.
# Function: Package and store the content published by the landlord in a txt file.
#---------------------------------------
"""
# Take a novel post as an Example
# Bdurl = 'HTTP: // tieba.baidu.com/p/2296712428? See_lz = 1 & pn = 1'
Print U' enter the numeric string at the end of the address of the clipboard :'
Bdurl = 'HTTP: // tieba.baidu.com/p/' + str (raw_input (u 'HTTP: // tieba.baidu.com/p /'))
# Call
MySpider = Baidu_Spider (bdurl)
MySpider. baidu_tieba ()
The above is all the code of Baidu Post Bar after improvement. It is very simple and practical. I hope it will help you.