Python crawler-crawl Baidu Post

Source: Internet
Author: User

This time the main learning to replace a variety of labels, standardized format method. Still refer to blogger Cia Qingcai's Blog.

1. Get the URL

A post: https://tieba.baidu.com/p/3138733512?see_lz=1&pn=1

Where https://tieba.baidu.com/p/3138733512? is the base part, the remainder is the parameter part.

http://on behalf of resource transfer using the HTTP protocol Tieba. Baidu. COM is Baidu's two-level domain name, pointing to Baidu paste the Server. /P/3138733512 is a server resource, that is, the address locator for this post See _LZ and pn is the URL of the two parameters, respectively, representing only the landlord and the post page number, equals 1 indicates that the condition is true    
    defgetpage (self, pagenum):Try: URL= Self.baseurl + Self.seelz +'&pn='+Str (pagenum) Request=urllib2. Request (url) Response=Urllib2.urlopen (request)#Print Response.read ()            #Print URL            returnResponse.read (). Decode ('Utf-8')        excepturllib2. urlerror, E:ifHasattr (e,'reason'):                Print 'wrong!', E.reasonreturnNone
2. Get the title

Because the title is surrounded by

def getTitle (self): = Self.getpage (1) = Re.compile ('

, Re. S) = re.search (pattern, page) if result: Print Result.group (1) Else: return None

3. Get post pages

For example, use the regular expression as Follows:

    def getpagenum (self):         = Self.getpage (1)        = Re.compile ('<li class= ' l_reply_num.*?</span>.*?<span.*? > (. *?) </span>', Re. S)        = re.search (pattern, page)        if  result:            Print Result.group (1)        Else:            return None

4. Get the main content of the landlord

    def getcontent (self):         = Self.getpage (1)        = Re.compile ('<div id= ' post_content_.*?> (. *?) </div>', Re. S)        = re.findall (pattern, page)        for in  items:              Print self.tool.replace (item)

The text is mainly included in the <div id= "post.....></div>, but the obvious text is interspersed with various line breaks, links, pictures, paragraph breaks, etc. So you need to remove or replace these Symbols.

The replacement code is as Follows:

classtool:removeimg= Re.compile ('| {7}|') #去除图像和7位空格 removeaddr= Re.compile ('<a.*?>|</a>') #去除链接 ReplaceLine= Re.compile ('<tr>|<div>|<div></p>') #换行符替换成 \ replacetd= Re.compile ('<td>') #制表符换位 \ t Replacepara= Re.compile ('<p.*?>') #段落符换位 \ n and two spaces Replacebr= Re.compile ('<br>|<br><br>') #换行符或双换行符替换为 \ Removeextratag= Re.compile ('<.*?>') #去掉其他符号defReplace (self, x): x= Re.sub (self.removeimg,"", X) x= Re.sub (self.removeaddr,"", X) x= Re.sub (self.replaceline,'\ n', X) x= Re.sub (self.replacetd,'\ t', X) x= Re.sub (self.replacepara,"\ n", X) x= Re.sub (self.replacebr,'\ n', X) x= Re.sub (self.removeextratag,"", X)returnX.strip ()

5. Overall Code and results

#Coding:utf-8ImportUrllibImportUrllib2ImportReclasstool:removeimg= Re.compile ('| {7}|') removeaddr= Re.compile ('<a.*?>|</a>') ReplaceLine= Re.compile ('<tr>|<div>|<div></p>') replacetd= Re.compile ('<td>') Replacepara= Re.compile ('<p.*?>') Replacebr= Re.compile ('<br>|<br><br>') Removeextratag= Re.compile ('<.*?>')    defReplace (self, x): x= Re.sub (self.removeimg,"", X) x= Re.sub (self.removeaddr,"", X) x= Re.sub (self.replaceline,'\ n', X) x= Re.sub (self.replacetd,'\ t', X) x= Re.sub (self.replacepara,"\ n", X) x= Re.sub (self.replacebr,'\ n', X) x= Re.sub (self.removeextratag,"", X)returnX.strip ()classTieba:def __init__(self, baseurl, seelz): self.baseurl=BaseURL Self.seelz='? see_lz='+str (seelz) Self.tool=Tool ()defgetpage (self, pagenum):Try: URL= Self.baseurl + Self.seelz +'&pn='+Str (pagenum) Request=urllib2. Request (url) Response=Urllib2.urlopen (request)#Print Response.read ()            #Print URL            returnResponse.read (). Decode ('Utf-8')        excepturllib2. urlerror, E:ifHasattr (e,'reason'):                Print 'wrong!', E.reasonreturnNonedefgetTitle (self): Page= Self.getpage (1) Pattern= Re.compile ('

', Re. S) result=re.search (pattern, Page)ifresult:PrintResult.group (1) Else: returnNonedefgetpagenum (self): Page= Self.getpage (1) Pattern= Re.compile ('<li class= "l_reply_num.*?</span>.*?<span.*?> (. *?) </span>', Re. S) result=re.search (pattern, Page)ifresult:PrintResult.group (1) Else: returnNonedefgetcontent (self): Page= Self.getpage (1) Pattern= Re.compile ('<div id= "post_content_.*?> (. *?) </div>', Re. S) Items=Re.findall (pattern, Page) forIteminchitems:Printself.tool.replace (item) BaseURL='https://tieba.baidu.com/p/3138733512'BDTB= Tieba (baseURL, 1)#bdtb.getpage (1)bdtb.gettitle () bdtb.getpagenum () bdtb.getcontent ( )

Python crawler-crawl Baidu Post

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.