A pyspider crawler application.

Last Update:2015-03-29 Source: Internet

Author: User

Tags webp

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

A pyspider crawler application.

1. To save the crawled data to the local database, create a mysql database example locally and then
Create a test table in the database, for example:

DROP TABLE IF EXISTS `test`;CREATE TABLE `douban_db` (  `id` int(11) NOT NULL AUTO_INCREMENT,  `url` varchar(20) NOT NULL,  `direct`  varchar(30),  `performer`  date,  `type`  varchar(30),  `district` varchar(20) NOT NULL,  `language`  varchar(30),  `date`  varchar(30),  `time`  varchar(30),  `alias` varchar(20) NOT NULL,  `score`  varchar(30),  `comments`  varchar(300),  `scenario`  varchar(300),  `IMDb`  varchar(30),  PRIMARY KEY (`id`)) ENGINE=MyISAM  DEFAULT CHARSET=utf8;

2. if you use the open-source framework pyspider for crawling, The crawled results will be stored in the result by default. the sqilite database db is stored in mysql for convenient operations. Next
One operation is to override the on_result method and instantiate and call our own SQL method.
Example:

#!/usr/bin/env python# -*- encoding: utf-8 -*-# Created on 2015-03-20 09:46:20# Project: fly_spiderimport refrom pyspider.database.mysql.mysqldb import SQLfrom pyspider.libs.base_handler import *class Handler(BaseHandler):    headers= {                          "Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",    "Accept-Encoding":"gzip, deflate, sdch",    "Accept-Language":"zh-CN,zh;q=0.8",    "Cache-Control":"max-age=0",    "Connection":"keep-alive",    "User-Agent":"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.101 Safari/537.36"    }    crawl_config = {        "headers" : headers,        "timeout" : 100    }    @every(minutes=24 * 60)    def on_start(self):        self.crawl('http://movie.douban.com/tag/', callback=self.index_page)    @config(age=10 * 24 * 60 * 60)    def index_page(self, response):        for each in response.doc('a[href^="http"]').items():            if re.match("http://movie.douban.com/tag/\w+", each.attr.href, re.U):                self.crawl(each.attr.href, callback=self.list_page)      @config(age=10*24*60*60, priority=2)                    def list_page(self, response):        for each in response.doc('html > body > div#wrapper > div#content > div.grid-16-8.clearfix > div.article > div > table tr.item > td > div.pl2 > a').items():            self.crawl(each.attr.href, priority=9, callback=self.detail_page)      @config(priority=3)    def detail_page(self, response):        return {            "url": response.url,            "title": response.doc('html > body > #wrapper > #content > h1 > span').text(),            "direct": ",".join(x.text() for x in response.doc('a[rel="v:directedBy"]').items()),            "performer": ",".join(x.text() for x in response.doc('a[rel="v:starring"]').items()),            "type": ",".join(x.text() for x in response.doc('span[property="v:genre"]').items()),#            "district":  "".join(x.text() for x in response.doc('a[rel="v:starring"]').items()),#            "language":  "".join(x.text() for x in response.doc('a[rel="v:starring"]').items()),            "date":  ",".join(x.text() for x in response.doc('span[property="v:initialReleaseDate"]').items()),            "time":  ",".join(x.text() for x in response.doc('span[property="v:runtime"]').items()),#            "alias":  "".join(x.text() for x in response.doc('a[rel="v:starring"]').items()),            "score": response.doc('.rating_num').text(),            "comments": response.doc('html > body > div#wrapper > div#content > div.grid-16-8.clearfix > div.article > div#comments-section > div.mod-hd > h2 > i').text(),            "scenario": response.doc('html > body > div#wrapper > div#content > div.grid-16-8.clearfix > div.article > div.related-info > div#link-report.indent').text(),            "IMDb":  "".join(x.text() for x in response.doc('span[href]').items()),            }    def on_result(self, result):        if not result or not result['title']:            return        sql = SQL()        sql.replace('douban_db',**result)

The code above has the following points to note:
A. in order to prevent the server from judging that the client is performing crawler operations and thus disabling ip Access (specifically, 403 forbidden access), we need to add an http header when sending a request, disguised as a browser access, the usage is as follows:

    headers= {                          "Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",    "Accept-Encoding":"gzip, deflate, sdch",    "Accept-Language":"zh-CN,zh;q=0.8",    "Cache-Control":"max-age=0",    "Connection":"keep-alive",    "User-Agent":"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.101 Safari/537.36"    }    crawl_config = {        "headers" : headers,        "timeout" : 100    }

B. @ every (minutes = 24*60) indicates that the task is executed once a day.
@ Config (age = 10*24*60*60) indicates that the data has expired in 10 days.

C. next, it is important to rewrite the on_result method, which is equivalent to implementing a polymorphism. The program will execute the on_result method when returning the result. By default, on_result is to fl data into sqlite, but if we need to insert data into mysql, We need to rewrite the on_result method. The specific usage is as follows:

    def on_result(self, result):        if not result or not result['title']:            return        sql = SQL()        sql.replace('test',**result)

Note that if not result or not result ['title']: This sentence is very important. Otherwise, an error will be reported, prompting that the result type is undefined.

3. in the preceding code, we mentioned instantiating and calling our own SQL method and referencing from pyspider. database. mysql. the mysqldb import SQL library file must be implemented in this directory, as follows:
Put the following content in the pyspider/database/mysql/directory and name it mysqldb. py.

From six import itervaluesimport mysql. connectorfrom datetime import date, datetime, timedeltaclass SQL: username = 'root' # database username password = 'root' # database password database = 'test' # database host = '123. 30.25.231 '# Database Host address connection = ''connect = True placeholder =' % s' def _ init _ (self): if self. connect: SQL. connect (self) def escape (self, string): return ''% s' % string def connect (self): config = {'user ': SQL. username, 'Password': SQL. password, 'host': SQL. host} if SQL. database! = None: config ['database'] = SQL. database try: cnx = mysql. connector. connect (** config) SQL. connection = cnx return True doesn't mysql. connector. error as err: if (err. errno = errorcode. ER_ACCESS_DENIED_ERROR): print "The credentials you provided are not correct. "elif (err. errno = errorcode. ER_BAD_DB_ERROR): print "The database you provided does not exist. "else: print" Something went wrong: ", err return False def replace (self, tablename = None, ** values): if SQL. connection = '': print" Please connect first "return False tablename = self. escape (tablename) if values: _ keys = ",". join (self. escape (k) for k in values) _ values = ",". join ([self. placeholder,] * len (values) SQL _query = "REPLACE INTO % s (% s) VALUES (% s)" % (tablename, _ keys, _ values) else: SQL _query = "replace into % s default values" % tablename cur = SQL. connection. cursor () try: if values: cur.exe cute (SQL _query, list (itervalues (values) else: cur.exe cute (SQL _query) SQL. connection. commit () return True against T mysql. connector. error as err: print ("An error occured :{}". format (err) return False

Learning documents: http://blog.binux.me/2015/01/pyspider-tutorial-level-1-html-and-css-selector/
Test environment: http://demo.pyspider.org/

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More