A pyspider crawler application.
1. To save the crawled data to the local database, create a mysql database example locally and then
Create a test table in the database, for example:
DROP TABLE IF EXISTS `test`;CREATE TABLE `douban_db` ( `id` int(11) NOT NULL AUTO_INCREMENT, `url` varchar(20) NOT NULL, `direct` varchar(30), `performer` date, `type` varchar(30), `district` varchar(20) NOT NULL, `language` varchar(30), `date` varchar(30), `time` varchar(30), `alias` varchar(20) NOT NULL, `score` varchar(30), `comments` varchar(300), `scenario` varchar(300), `IMDb` varchar(30), PRIMARY KEY (`id`)) ENGINE=MyISAM DEFAULT CHARSET=utf8;
2. if you use the open-source framework pyspider for crawling, The crawled results will be stored in the result by default. the sqilite database db is stored in mysql for convenient operations. Next
One operation is to override the on_result method and instantiate and call our own SQL method.
Example:
#!/usr/bin/env python# -*- encoding: utf-8 -*-# Created on 2015-03-20 09:46:20# Project: fly_spiderimport refrom pyspider.database.mysql.mysqldb import SQLfrom pyspider.libs.base_handler import *class Handler(BaseHandler): headers= { "Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8", "Accept-Encoding":"gzip, deflate, sdch", "Accept-Language":"zh-CN,zh;q=0.8", "Cache-Control":"max-age=0", "Connection":"keep-alive", "User-Agent":"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.101 Safari/537.36" } crawl_config = { "headers" : headers, "timeout" : 100 } @every(minutes=24 * 60) def on_start(self): self.crawl('http://movie.douban.com/tag/', callback=self.index_page) @config(age=10 * 24 * 60 * 60) def index_page(self, response): for each in response.doc('a[href^="http"]').items(): if re.match("http://movie.douban.com/tag/\w+", each.attr.href, re.U): self.crawl(each.attr.href, callback=self.list_page) @config(age=10*24*60*60, priority=2) def list_page(self, response): for each in response.doc('html > body > div#wrapper > div#content > div.grid-16-8.clearfix > div.article > div > table tr.item > td > div.pl2 > a').items(): self.crawl(each.attr.href, priority=9, callback=self.detail_page) @config(priority=3) def detail_page(self, response): return { "url": response.url, "title": response.doc('html > body > #wrapper > #content > h1 > span').text(), "direct": ",".join(x.text() for x in response.doc('a[rel="v:directedBy"]').items()), "performer": ",".join(x.text() for x in response.doc('a[rel="v:starring"]').items()), "type": ",".join(x.text() for x in response.doc('span[property="v:genre"]').items()),# "district": "".join(x.text() for x in response.doc('a[rel="v:starring"]').items()),# "language": "".join(x.text() for x in response.doc('a[rel="v:starring"]').items()), "date": ",".join(x.text() for x in response.doc('span[property="v:initialReleaseDate"]').items()), "time": ",".join(x.text() for x in response.doc('span[property="v:runtime"]').items()),# "alias": "".join(x.text() for x in response.doc('a[rel="v:starring"]').items()), "score": response.doc('.rating_num').text(), "comments": response.doc('html > body > div#wrapper > div#content > div.grid-16-8.clearfix > div.article > div#comments-section > div.mod-hd > h2 > i').text(), "scenario": response.doc('html > body > div#wrapper > div#content > div.grid-16-8.clearfix > div.article > div.related-info > div#link-report.indent').text(), "IMDb": "".join(x.text() for x in response.doc('span[href]').items()), } def on_result(self, result): if not result or not result['title']: return sql = SQL() sql.replace('douban_db',**result)
The code above has the following points to note:
A. in order to prevent the server from judging that the client is performing crawler operations and thus disabling ip Access (specifically, 403 forbidden access), we need to add an http header when sending a request, disguised as a browser access, the usage is as follows:
headers= { "Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8", "Accept-Encoding":"gzip, deflate, sdch", "Accept-Language":"zh-CN,zh;q=0.8", "Cache-Control":"max-age=0", "Connection":"keep-alive", "User-Agent":"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.101 Safari/537.36" } crawl_config = { "headers" : headers, "timeout" : 100 }
B. @ every (minutes = 24*60) indicates that the task is executed once a day.
@ Config (age = 10*24*60*60) indicates that the data has expired in 10 days.
C. next, it is important to rewrite the on_result method, which is equivalent to implementing a polymorphism. The program will execute the on_result method when returning the result. By default, on_result is to fl data into sqlite, but if we need to insert data into mysql, We need to rewrite the on_result method. The specific usage is as follows:
def on_result(self, result): if not result or not result['title']: return sql = SQL() sql.replace('test',**result)
Note that if not result or not result ['title']: This sentence is very important. Otherwise, an error will be reported, prompting that the result type is undefined.
3. in the preceding code, we mentioned instantiating and calling our own SQL method and referencing from pyspider. database. mysql. the mysqldb import SQL library file must be implemented in this directory, as follows:
Put the following content in the pyspider/database/mysql/directory and name it mysqldb. py.
From six import itervaluesimport mysql. connectorfrom datetime import date, datetime, timedeltaclass SQL: username = 'root' # database username password = 'root' # database password database = 'test' # database host = '123. 30.25.231 '# Database Host address connection = ''connect = True placeholder =' % s' def _ init _ (self): if self. connect: SQL. connect (self) def escape (self, string): return ''% s' % string def connect (self): config = {'user ': SQL. username, 'Password': SQL. password, 'host': SQL. host} if SQL. database! = None: config ['database'] = SQL. database try: cnx = mysql. connector. connect (** config) SQL. connection = cnx return True doesn't mysql. connector. error as err: if (err. errno = errorcode. ER_ACCESS_DENIED_ERROR): print "The credentials you provided are not correct. "elif (err. errno = errorcode. ER_BAD_DB_ERROR): print "The database you provided does not exist. "else: print" Something went wrong: ", err return False def replace (self, tablename = None, ** values): if SQL. connection = '': print" Please connect first "return False tablename = self. escape (tablename) if values: _ keys = ",". join (self. escape (k) for k in values) _ values = ",". join ([self. placeholder,] * len (values) SQL _query = "REPLACE INTO % s (% s) VALUES (% s)" % (tablename, _ keys, _ values) else: SQL _query = "replace into % s default values" % tablename cur = SQL. connection. cursor () try: if values: cur.exe cute (SQL _query, list (itervalues (values) else: cur.exe cute (SQL _query) SQL. connection. commit () return True against T mysql. connector. error as err: print ("An error occured :{}". format (err) return False
Learning documents: http://blog.binux.me/2015/01/pyspider-tutorial-level-1-html-and-css-selector/
Test environment: http://demo.pyspider.org/