From the great God here? Python crawler two crawl Baidu paste post is very good, follow the steps step by step to do the effect is obvious. The first real to make a small reptile program, so in Csdn write out is also a spur to their own kind of encouragement, do not like to spray, but also please the great God to enlighten.
Because the Great God blog is very detailed (really detailed), so the steps I will not elaborate
First put your own code (most of the same):
#!/usr/bin/env python# coding=utf-8import urllib2import urllibimport reclass Tool (object): removeimg = Re.compile (' img. *?| {7} ') Removeaddr = Re.compile (' <a.8?>|</a> ') ReplaceLine = Re.compile (' <TR>|<DIV>|</DIV&G T;|</p> ') replacetd = Re.compile (' <td> ') Replacepara = Re.compile (' <p.*?> ') Replacebr = Re.comp Ile (' <br>|<br><br> ') Removeextratag = Re.compile (' <.*?> ') def replace (self, x): x = r E.sub (Self.removeimg, ", x) x = Re.sub (Self.removeaddr,", x) x = re.sub (self.replaceline, ' \ n ', x) x = Re.sub (self.replacetd, ' \ t ', x) x = re.sub (Self.replacepara, ' \ n ', x) x = re.sub (Self.replacebr, ' \ n ', x) x = Re.sub (Self.removeextratag, ", X) return X.strip () class Bdtb (object): Def __init__ (self, baseur L, Seelz): Self.baseurl = BaseURL Self.seelz = '? see_lz= ' +str (Seelz) Self.tool = tool () self. Defaulttitle = U ' Baidu stickerBar ' Self.four = 1 self.file = None def getpage (self, pagenum): Try:url = Self.baseurl + self.seelz+ ' &pn= ' +str (pagenum) request = Urllib2. Request (URL) response = Urllib2.urlopen (Request) #print Response.read () return RESPONSE.R EAD (). Decode (' Utf-8 ') except URLLIB2. Urlerror, E:if hasattr (E, ' reason '): Print u ' link baidu paste failed, reason ', E.reason return None def gettitle (self): page = self.getpage (1) pattern = Re.compile (R ' .*?
Here are some tips on how to get this little program done:The overall sense of the reptile is, OK url,-"Get URL page-" Find the data to extract from the page-"Determine the processing page content regular expression-" preliminary extraction of the required data-"additional interaction and retouching
First FETCH page: It is not difficult to get the page, and there is a certain pattern
Request = Urllib2. Request (URL) #这儿就是返回一个Request类response = Urllib2.urlopen (Request) #这儿就是得到了要获取的页面response #print response.read () #如果要打印输出的时候, use the built-in read function to read return Response.read (). Decode (' Utf-8 ')
The second is to extract the required data from the page: This is the core of crawling, the individual feel is a difficult point.
A regular expression is needed to extract the data, but it was unclear what the regular expression was, but the regular expression has been known since the Django part of the document and the Python core programming. The so-called regular expression is a way of processing text, and very powerful and magical, through regular expressions, we can quickly process the text, so as to achieve the purpose. There are a lot of regular expressions on the network, and I won't introduce them.
What I learned in this small program is how to extract data from a Web page. Because the pages we crawl are usually HTML, all kinds of tags are flooded with them. But a variety of tags in a certain pattern. Like, "Open http://www.qiushibaike.com/hot/page."
View the source code (I did not cite Baidu paste the example, is an example of embarrassing encyclopedia, because the case of embarrassing cases particularly obvious ~_~), we can find that all the jokes are in the following format appear
<div class= "Content" > today to Worship, badachu the longevity stone, see the whole family to live long! <!--1440408912--></div>
And each of the sections is divided by the following Div
We can easily extract the data as long as we set the regular expression pattern for the corresponding processing.
So the experience of extracting data is to carefully analyze the Web page source code, set up the corresponding regular expression can be
The last is the extra interaction, you can take care of it!!
Copyright NOTICE: This article for Bo Master original article, without Bo Master permission not reproduced.
Python Learning notes crawler crawl Baidu paste a post