A pyspider crawler application.

Source: Internet
Author: User
Tags webp

A pyspider crawler application.

1. To save the crawled data to the local database, create a mysql database example locally and then
Create a test table in the database, for example:

DROP TABLE IF EXISTS `test`;CREATE TABLE `douban_db` (  `id` int(11) NOT NULL AUTO_INCREMENT,  `url` varchar(20) NOT NULL,  `direct`  varchar(30),  `performer`  date,  `type`  varchar(30),  `district` varchar(20) NOT NULL,  `language`  varchar(30),  `date`  varchar(30),  `time`  varchar(30),  `alias` varchar(20) NOT NULL,  `score`  varchar(30),  `comments`  varchar(300),  `scenario`  varchar(300),  `IMDb`  varchar(30),  PRIMARY KEY (`id`)) ENGINE=MyISAM  DEFAULT CHARSET=utf8;

2. if you use the open-source framework pyspider for crawling, The crawled results will be stored in the result by default. the sqilite database db is stored in mysql for convenient operations. Next
One operation is to override the on_result method and instantiate and call our own SQL method.
Example:

#!/usr/bin/env python# -*- encoding: utf-8 -*-# Created on 2015-03-20 09:46:20# Project: fly_spiderimport refrom pyspider.database.mysql.mysqldb import SQLfrom pyspider.libs.base_handler import *class Handler(BaseHandler):    headers= {                          "Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",    "Accept-Encoding":"gzip, deflate, sdch",    "Accept-Language":"zh-CN,zh;q=0.8",    "Cache-Control":"max-age=0",    "Connection":"keep-alive",    "User-Agent":"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.101 Safari/537.36"    }    crawl_config = {        "headers" : headers,        "timeout" : 100    }    @every(minutes=24 * 60)    def on_start(self):        self.crawl('http://movie.douban.com/tag/', callback=self.index_page)    @config(age=10 * 24 * 60 * 60)    def index_page(self, response):        for each in response.doc('a[href^="http"]').items():            if re.match("http://movie.douban.com/tag/\w+", each.attr.href, re.U):                self.crawl(each.attr.href, callback=self.list_page)      @config(age=10*24*60*60, priority=2)                    def list_page(self, response):        for each in response.doc('html > body > div#wrapper > div#content > div.grid-16-8.clearfix > div.article > div > table tr.item > td > div.pl2 > a').items():            self.crawl(each.attr.href, priority=9, callback=self.detail_page)      @config(priority=3)    def detail_page(self, response):        return {            "url": response.url,            "title": response.doc('html > body > #wrapper > #content > h1 > span').text(),            "direct": ",".join(x.text() for x in response.doc('a[rel="v:directedBy"]').items()),            "performer": ",".join(x.text() for x in response.doc('a[rel="v:starring"]').items()),            "type": ",".join(x.text() for x in response.doc('span[property="v:genre"]').items()),#            "district":  "".join(x.text() for x in response.doc('a[rel="v:starring"]').items()),#            "language":  "".join(x.text() for x in response.doc('a[rel="v:starring"]').items()),            "date":  ",".join(x.text() for x in response.doc('span[property="v:initialReleaseDate"]').items()),            "time":  ",".join(x.text() for x in response.doc('span[property="v:runtime"]').items()),#            "alias":  "".join(x.text() for x in response.doc('a[rel="v:starring"]').items()),            "score": response.doc('.rating_num').text(),            "comments": response.doc('html > body > div#wrapper > div#content > div.grid-16-8.clearfix > div.article > div#comments-section > div.mod-hd > h2 > i').text(),            "scenario": response.doc('html > body > div#wrapper > div#content > div.grid-16-8.clearfix > div.article > div.related-info > div#link-report.indent').text(),            "IMDb":  "".join(x.text() for x in response.doc('span[href]').items()),            }    def on_result(self, result):        if not result or not result['title']:            return        sql = SQL()        sql.replace('douban_db',**result)    

The code above has the following points to note:
A. in order to prevent the server from judging that the client is performing crawler operations and thus disabling ip Access (specifically, 403 forbidden access), we need to add an http header when sending a request, disguised as a browser access, the usage is as follows:

    headers= {                          "Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",    "Accept-Encoding":"gzip, deflate, sdch",    "Accept-Language":"zh-CN,zh;q=0.8",    "Cache-Control":"max-age=0",    "Connection":"keep-alive",    "User-Agent":"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.101 Safari/537.36"    }    crawl_config = {        "headers" : headers,        "timeout" : 100    }

B. @ every (minutes = 24*60) indicates that the task is executed once a day.
@ Config (age = 10*24*60*60) indicates that the data has expired in 10 days.

C. next, it is important to rewrite the on_result method, which is equivalent to implementing a polymorphism. The program will execute the on_result method when returning the result. By default, on_result is to fl data into sqlite, but if we need to insert data into mysql, We need to rewrite the on_result method. The specific usage is as follows:

    def on_result(self, result):        if not result or not result['title']:            return        sql = SQL()        sql.replace('test',**result)    

Note that if not result or not result ['title']: This sentence is very important. Otherwise, an error will be reported, prompting that the result type is undefined.

3. in the preceding code, we mentioned instantiating and calling our own SQL method and referencing from pyspider. database. mysql. the mysqldb import SQL library file must be implemented in this directory, as follows:
Put the following content in the pyspider/database/mysql/directory and name it mysqldb. py.

From six import itervaluesimport mysql. connectorfrom datetime import date, datetime, timedeltaclass SQL: username = 'root' # database username password = 'root' # database password database = 'test' # database host = '123. 30.25.231 '# Database Host address connection = ''connect = True placeholder =' % s' def _ init _ (self): if self. connect: SQL. connect (self) def escape (self, string): return ''% s' % string def connect (self): config = {'user ': SQL. username, 'Password': SQL. password, 'host': SQL. host} if SQL. database! = None: config ['database'] = SQL. database try: cnx = mysql. connector. connect (** config) SQL. connection = cnx return True doesn't mysql. connector. error as err: if (err. errno = errorcode. ER_ACCESS_DENIED_ERROR): print "The credentials you provided are not correct. "elif (err. errno = errorcode. ER_BAD_DB_ERROR): print "The database you provided does not exist. "else: print" Something went wrong: ", err return False def replace (self, tablename = None, ** values): if SQL. connection = '': print" Please connect first "return False tablename = self. escape (tablename) if values: _ keys = ",". join (self. escape (k) for k in values) _ values = ",". join ([self. placeholder,] * len (values) SQL _query = "REPLACE INTO % s (% s) VALUES (% s)" % (tablename, _ keys, _ values) else: SQL _query = "replace into % s default values" % tablename cur = SQL. connection. cursor () try: if values: cur.exe cute (SQL _query, list (itervalues (values) else: cur.exe cute (SQL _query) SQL. connection. commit () return True against T mysql. connector. error as err: print ("An error occured :{}". format (err) return False

Learning documents: http://blog.binux.me/2015/01/pyspider-tutorial-level-1-html-and-css-selector/
Test environment: http://demo.pyspider.org/

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.