Python Learning notes crawler crawl Baidu paste a post

Last Update:2015-08-25 Source: Internet

Author: User

Tags processing text

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

From the great God here? Python crawler two crawl Baidu paste post is very good, follow the steps step by step to do the effect is obvious. The first real to make a small reptile program, so in Csdn write out is also a spur to their own kind of encouragement, do not like to spray, but also please the great God to enlighten.

Because the Great God blog is very detailed (really detailed), so the steps I will not elaborate

First put your own code (most of the same):

#!/usr/bin/env python# coding=utf-8import urllib2import urllibimport reclass Tool (object): removeimg = Re.compile (' img. *?| {7} ') Removeaddr = Re.compile (' <a.8?>|</a> ') ReplaceLine = Re.compile (' &LT;TR&GT;|&LT;DIV&GT;|&LT;/DIV&G T;|</p> ') replacetd = Re.compile (' <td> ') Replacepara = Re.compile (' <p.*?> ') Replacebr = Re.comp Ile (' <br>|<br><br> ') Removeextratag = Re.compile (' <.*?> ') def replace (self, x): x = r  E.sub (Self.removeimg, ", x) x = Re.sub (Self.removeaddr,", x) x = re.sub (self.replaceline, ' \ n ', x) x         = Re.sub (self.replacetd, ' \ t ', x) x = re.sub (Self.replacepara, ' \ n ', x) x = re.sub (Self.replacebr, ' \ n ', x) x = Re.sub (Self.removeextratag, ", X) return X.strip () class Bdtb (object): Def __init__ (self, baseur L, Seelz): Self.baseurl = BaseURL Self.seelz = '? see_lz= ' +str (Seelz) Self.tool = tool () self. Defaulttitle = U ' Baidu stickerBar ' Self.four = 1 self.file = None def getpage (self, pagenum): Try:url = Self.baseurl + self.seelz+ ' &pn= ' +str (pagenum) request = Urllib2. Request (URL) response = Urllib2.urlopen (Request) #print Response.read () return RESPONSE.R EAD (). Decode (' Utf-8 ') except URLLIB2.         Urlerror, E:if hasattr (E, ' reason '): Print u ' link baidu paste failed, reason ', E.reason return None def gettitle (self): page = self.getpage (1) pattern = Re.compile (R ' .*?
Here are some tips on how to get this little program done:
The overall sense of the reptile is, OK url,-"Get URL page-" Find the data to extract from the page-"Determine the processing page content regular expression-" preliminary extraction of the required data-"additional interaction and retouching
First FETCH page: It is not difficult to get the page, and there is a certain pattern
Request = Urllib2. Request (URL) #这儿就是返回一个Request类response = Urllib2.urlopen (Request)  #这儿就是得到了要获取的页面response #print response.read () #如果要打印输出的时候, use the built-in read function to read return Response.read (). Decode (' Utf-8 ')
The second is to extract the required data from the page: This is the core of crawling, the individual feel is a difficult point.
A regular expression is needed to extract the data, but it was unclear what the regular expression was, but the regular expression has been known since the Django part of the document and the Python core programming. The so-called regular expression is a way of processing text, and very powerful and magical, through regular expressions, we can quickly process the text, so as to achieve the purpose. There are a lot of regular expressions on the network, and I won't introduce them.
What I learned in this small program is how to extract data from a Web page. Because the pages we crawl are usually HTML, all kinds of tags are flooded with them. But a variety of tags in a certain pattern. Like, "Open http://www.qiushibaike.com/hot/page."
View the source code (I did not cite Baidu paste the example, is an example of embarrassing encyclopedia, because the case of embarrassing cases particularly obvious ~_~), we can find that all the jokes are in the following format appear
<div class= "Content" > today to Worship, badachu the longevity stone, see the whole family to live long! <!--1440408912--></div>
And each of the sections is divided by the following Div

We can easily extract the data as long as we set the regular expression pattern for the corresponding processing.
So the experience of extracting data is to carefully analyze the Web page source code, set up the corresponding regular expression can be
The last is the extra interaction, you can take care of it!!
 
Copyright NOTICE: This article for Bo Master original article, without Bo Master permission not reproduced.
 
Python Learning notes crawler crawl Baidu paste a post

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More