This time the main learning to replace a variety of labels, standardized format method. Still refer to blogger Cia Qingcai's Blog.
1. Get the URL
A post: https://tieba.baidu.com/p/3138733512?see_lz=1&pn=1
Where https://tieba.baidu.com/p/3138733512? is the base part, the remainder is the parameter part.
http://on behalf of resource transfer using the HTTP protocol Tieba. Baidu. COM is Baidu's two-level domain name, pointing to Baidu paste the Server. /P/3138733512 is a server resource, that is, the address locator for this post See _LZ and pn is the URL of the two parameters, respectively, representing only the landlord and the post page number, equals 1 indicates that the condition is true
defgetpage (self, pagenum):Try: URL= Self.baseurl + Self.seelz +'&pn='+Str (pagenum) Request=urllib2. Request (url) Response=Urllib2.urlopen (request)#Print Response.read () #Print URL returnResponse.read (). Decode ('Utf-8') excepturllib2. urlerror, E:ifHasattr (e,'reason'): Print 'wrong!', E.reasonreturnNone 2. Get the title
Because the title is surrounded by
def getTitle (self): = Self.getpage (1) = Re.compile ('
, Re. S) = re.search (pattern, page) if result: Print Result.group (1) Else: return None
3. Get post pages
For example, use the regular expression as Follows:
def getpagenum (self): = Self.getpage (1) = Re.compile ('<li class= ' l_reply_num.*?</span>.*?<span.*? > (. *?) </span>', Re. S) = re.search (pattern, page) if result: Print Result.group (1) Else: return None
4. Get the main content of the landlord
def getcontent (self): = Self.getpage (1) = Re.compile ('<div id= ' post_content_.*?> (. *?) </div>', Re. S) = re.findall (pattern, page) for in items: Print self.tool.replace (item)
The text is mainly included in the <div id= "post.....></div>, but the obvious text is interspersed with various line breaks, links, pictures, paragraph breaks, etc. So you need to remove or replace these Symbols.
The replacement code is as Follows:
classtool:removeimg= Re.compile ('| {7}|') #去除图像和7位空格 removeaddr= Re.compile ('<a.*?>|</a>') #去除链接 ReplaceLine= Re.compile ('<tr>|<div>|<div></p>') #换行符替换成 \ replacetd= Re.compile ('<td>') #制表符换位 \ t Replacepara= Re.compile ('<p.*?>') #段落符换位 \ n and two spaces Replacebr= Re.compile ('<br>|<br><br>') #换行符或双换行符替换为 \ Removeextratag= Re.compile ('<.*?>') #去掉其他符号defReplace (self, x): x= Re.sub (self.removeimg,"", X) x= Re.sub (self.removeaddr,"", X) x= Re.sub (self.replaceline,'\ n', X) x= Re.sub (self.replacetd,'\ t', X) x= Re.sub (self.replacepara,"\ n", X) x= Re.sub (self.replacebr,'\ n', X) x= Re.sub (self.removeextratag,"", X)returnX.strip ()
5. Overall Code and results
#Coding:utf-8ImportUrllibImportUrllib2ImportReclasstool:removeimg= Re.compile ('| {7}|') removeaddr= Re.compile ('<a.*?>|</a>') ReplaceLine= Re.compile ('<tr>|<div>|<div></p>') replacetd= Re.compile ('<td>') Replacepara= Re.compile ('<p.*?>') Replacebr= Re.compile ('<br>|<br><br>') Removeextratag= Re.compile ('<.*?>') defReplace (self, x): x= Re.sub (self.removeimg,"", X) x= Re.sub (self.removeaddr,"", X) x= Re.sub (self.replaceline,'\ n', X) x= Re.sub (self.replacetd,'\ t', X) x= Re.sub (self.replacepara,"\ n", X) x= Re.sub (self.replacebr,'\ n', X) x= Re.sub (self.removeextratag,"", X)returnX.strip ()classTieba:def __init__(self, baseurl, seelz): self.baseurl=BaseURL Self.seelz='? see_lz='+str (seelz) Self.tool=Tool ()defgetpage (self, pagenum):Try: URL= Self.baseurl + Self.seelz +'&pn='+Str (pagenum) Request=urllib2. Request (url) Response=Urllib2.urlopen (request)#Print Response.read () #Print URL returnResponse.read (). Decode ('Utf-8') excepturllib2. urlerror, E:ifHasattr (e,'reason'): Print 'wrong!', E.reasonreturnNonedefgetTitle (self): Page= Self.getpage (1) Pattern= Re.compile ('', Re. S) result=re.search (pattern, Page)ifresult:PrintResult.group (1) Else: returnNonedefgetpagenum (self): Page= Self.getpage (1) Pattern= Re.compile ('<li class= "l_reply_num.*?</span>.*?<span.*?> (. *?) </span>', Re. S) result=re.search (pattern, Page)ifresult:PrintResult.group (1) Else: returnNonedefgetcontent (self): Page= Self.getpage (1) Pattern= Re.compile ('<div id= "post_content_.*?> (. *?) </div>', Re. S) Items=Re.findall (pattern, Page) forIteminchitems:Printself.tool.replace (item) BaseURL='https://tieba.baidu.com/p/3138733512'BDTB= Tieba (baseURL, 1)#bdtb.getpage (1)bdtb.gettitle () bdtb.getpagenum () bdtb.getcontent ( )
Python crawler-crawl Baidu Post