Crawlers of zero-basic writing python crawlers crawl Baidu posts and store them to the local txt file ultimate version,

Last Update:2014-11-07 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

The crawler production of Baidu Post Bar is basically the same as that of baibai. Key Data is deducted from the source code and stored in the local txt file.

Project content:

Web Crawler of Baidu Post Bar written in Python.

Usage:

Create a new BugBaidu. py file, copy the code to it, and double-click it to run.

Program functions:

Package the content published by the poster in the Post Bar and store the txt file locally.

Principles:

First, let's look at a certain post. After clicking only the main poster and clicking the second page, the url changes a bit and becomes:
Http://tieba.baidu.com/p/2296712428? See_lz = 1 & pn = 1
As you can see, see_lz = 1 is for the landlord only, and pn = 1 is for the corresponding page number. Remember this for future preparation.
This is the url we need to use.
The next step is to view the page source code.
First, it will be used when the question is extracted to store files.
We can see that Baidu uses gbk encoding and the title is marked with h1:

Copy codeThe Code is as follows:
<H1 class = "core_title_txt" title = "[original] Head of fashion (about fashion, fame and fortune, career, love, inspirational)"> [original] Head of fashion (about fashion, fame and fortune, career, love, inspirational)

Similarly, the body is marked with div and class. The next step is to use a regular expression for matching.
Run:

Generated txt file:

Copy codeThe Code is as follows:
#-*-Coding: UTF-8 -*-
#---------------------------------------
# Program: Baidu Post Bar Crawler
# Version 0.5
# Author: why
# Date: 2013-05-16
# Programming language: Python 2.7
# Operation: After the URL is entered, only the owner is automatically displayed and saved to a local file.
# Function: Package and store the content published by the landlord in a txt file.
#---------------------------------------

Import string
Import urllib2
Import re

# ----------- Process various labels on the page -----------
Class HTML_Tool:
# Match \ t, \ n, space, hyperlink, or image in non-Greedy Mode
BgnCharToNoneRex = re. compile ("(\ t | \ n | <a. *?> | ) ")

# Match any tag in non-Greedy mode <>
EndCharToNoneRex = re. compile ("<. *?> ")

# Match any <p> tag in non-Greedy Mode
BgnPartRex = re. compile ("<p. *?> ")
CharToNewLineRex = re. compile ("(<br/> | </p> | <tr> | <div> | </div> )")
CharToNextTabRex = re. compile ("<td> ")

# Convert some html symbolic entities into original symbols
ReplaceTab = [("<", "<"), (">", "> "),("&","&"),("&", "\" "), (" "," ")]

Def Replace_Char (self, x ):
X = self. BgnCharToNoneRex. sub ("", x)
X = self. BgnPartRex. sub ("\ n", x)
X = self. CharToNewLineRex. sub ("\ n", x)
X = self. CharToNextTabRex. sub ("\ t", x)
X = self. EndCharToNoneRex. sub ("", x)

For t in self. replaceTab:
X = x. replace (t [0], t [1])
Return x

Class Baidu_Spider:
# Declaring related attributes
Def _ init _ (self, url ):
Self. myUrl = url + '? See_lz = 1'
Self. datas = []
Self. myTool = HTML_Tool ()
Print u'baidu Post Bar crawlers have been started, click and click'

# Initialize and load the page and save it for Transcoding
Def baidu_tieba (self ):
# Read the original page information and transcode it from gbk
MyPage = urllib2.urlopen (self. myUrl). read (). decode ("gbk ")
# Calculate the total number of pages published by the poster
EndPage = self. page_counter (myPage)
# Obtain the title of the post
Title = self. find_title (myPage)
Print u'article name: '+ title
# Obtain final data
Self. save_data (self. myUrl, title, endPage)

# Used to calculate the total number of pages
Def page_counter (self, myPage ):
# Match the "Total <span class =" red "> 12 </span> pages" to obtain the total number of pages
MyMatch = re. search (r'class = "red"> (\ d + ?) </Span> ', myPage, re. S)
If myMatch:
EndPage = int (myMatch. group (1 ))
Print u'crawler Report: found the original content of Page % d '% endPage
Else:
EndPage = 0
Print u'crawler Report: you cannot calculate the number of pages published by the poster! '
Return endPage

# Used to find the title of the post
Def find_title (self, myPage ):
# Match

MyMatch = re. search (R' Title = u'no title'
If myMatch:
Title = myMatch. group (1)
Else:
Print u'crawler Report: Unable to load article title! '
# The file name cannot contain the following characters :\/:*? "<> |
Title = title. replace ('\\',''). replace ('/',''). replace (':',''). replace ('*',''). replace ('? ',''). Replace ('"',''). replace ('> ',''). replace ('<',''). replace ('| ','')
Return title

# Used to store the content published by the poster
Def save_data (self, url, title, endPage ):
# Load page data to the array
Self. get_data (url, endPage)
# Open a local file
F = open(titlepolic'.txt ', 'W + ')
F. writelines (self. datas)
F. close ()
Print u'crawler report: the file has been downloaded locally and packaged as a txt file'
Print U' press any key to exit ...'
Raw_input ();

# Obtain the page source code and store it in an array
Def get_data (self, url, endPage ):
Url = url + '& pn ='
For I in range (1, endPage + 1 ):
Print u'crawler Report: crawler % d loading... '% I
MyPage = urllib2.urlopen (url + str (I). read ()
# Process html code in myPage and store it in datas
Self. deal_data (myPage. decode ('gbk '))

# Extract the content from the page code
Def deal_data (self, myPage ):
MyItems = re. findall ('Id = "post_content. *?> (.*?) </Div> ', myPage, re. S)
For item in myItems:
Data = self. myTool. Replace_Char (item. replace ("\ n", ""). encode ('gbk '))
Self. datas. append (data + '\ n ')

# -------- Program entrance ------------------
Print u """#---------------------------------------
# Program: Baidu Post Bar Crawler
# Version 0.5
# Author: why
# Date: 2013-05-16
# Programming language: Python 2.7
# Operation: After the URL is entered, only the owner is automatically displayed and saved to a local file.
# Function: Package and store the content published by the landlord in a txt file.
#---------------------------------------
"""
# Take a novel post as an Example
# Bdurl = 'HTTP: // tieba.baidu.com/p/2296712428? See_lz = 1 & pn = 1'

Print U' enter the numeric string at the end of the address of the clipboard :'
Bdurl = 'HTTP: // tieba.baidu.com/p/' + str (raw_input (u 'HTTP: // tieba.baidu.com/p /'))

# Call
MySpider = Baidu_Spider (bdurl)
MySpider. baidu_tieba ()

The above is all the code of Baidu Post Bar after improvement. It is very simple and practical. I hope it will help you.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Crawlers of zero-basic writing python crawlers crawl Baidu posts and store them to the local txt file ultimate version,

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Crawlers of zero-basic writing python crawlers crawl Baidu posts and store them to the local txt file ultimate version,

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support