Python incremental crawler Pyspider

Last Update:2017-07-21 Source: Internet

Author: User

Tags webp

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

1. In order to be able to save the crawled data to the local database, now create a MySQL database example locally, and then
Create a table test in the database with the following example:

DROPTABLEIFEXISTS' Test ';CREATETABLE' Douban_db ' (' ID 'int11)NotNULL Auto_increment,' URL 'varchar20)NotNull' Direct 'varchar30),' Performer 'Date' Type 'varchar30),' District 'varchar20)NotNull' Language 'varchar30),' Date 'varchar30), ' time ' varchar (30),  ' alias ' varchar (20) not null,  ' score ' varchar (30),  ' comments ' varchar (300),  ' scenario ' varchar (300),  ' IMDb ' varchar (30), PRIMARY key ( ' id ')) engine=myisam default CHARSET=utf8 ;

2. If you use the Open source framework Pyspider to do the crawler, by default, the crawled results will be stored in the Result.db sqilite database, but for the convenience of operation, we will store the results in MySQL. Pick up the
One of the things to do is to override the On_result method, instantiate the SQL method that calls our own implementation, specifically
Examples are as follows:

#!/usr/bin/env python#-*-Encoding:utf-8-*-# Created on 2015-03-20 09:46:20# Project:fly_spiderImport reFrom Pyspider.database.mysql.mysqldbImport SQLFrom Pyspider.libs.base_handlerImport *ClassHandler(Basehandler): headers= {"Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8","Accept-encoding":"Gzip, deflate, SDCH","Accept-language":"zh-cn,zh;q=0.8","Cache-control":"Max-age=0","Connection":"Keep-alive","User-agent":"mozilla/5.0 (Windows NT 6.1; WOW64) applewebkit/537.36 (khtml, like Gecko) chrome/41.0.2272.101 safari/537.36 "} crawl_config = {"Headers": Headers,"Timeout":100}@every (MINUTES=24 * 60)DefOn_Start(self): Self.crawl (' http://movie.douban.com/tag/', callback=self.index_page)@config (AGE=10 * 24 * 60 * 60)DefIndex_page(Self, Response):For eachIn Response.doc (' a[href^= "http"]. Items ():If Re.match ("http://movie.douban.com/tag/\w+", Each.attr.href, re. U): Self.crawl (Each.attr.href, Callback=self.list_page)@config (age=10*24*60*60, priority=2)DefList_page(Self, Response):For eachIn Response.doc (' HTML > Body > Div#wrapper > Div#content > Div.grid-16-8.clearfix > Div.article > div > Table Tr.item > td > Div.pl2 > A '). Items (): Self.crawl (Each.attr.href, priority=9, Callback=self.detail_page)@config (priority=3)DefDetail_page(Self, Response):return {"url": Response.url,"title": Response.doc (' HTML > Body > #wrapper > #content > H1 > Span '). Text (),"Direct":",". Join (X.text ()For XIn Response.doc (' A[rel= ' V:directedby "]). Items ()),"Performer":",". Join (X.text ()For XIn Response.doc (' A[rel= ' v:starring "]). Items ()),"Type":",". Join (X.text ()For XIn Response.doc (' span[property= ' V:genre "]). Items ()),# "District": "". Join (X.text () for x in Response.doc (' a[rel= "v:starring"] '). Items ()),# "Language": "". Join (X.text () for x in Response.doc (' a[rel= "v:starring"] '). Items ()),"Date":",". Join (X.text ()For XIn Response.doc (' span[property= ' V:initialreleasedate "]). Items ()),"Time":",". Join (X.text ()For XIn Response.doc (' span[property= ' V:runtime "]). Items ()),# "Alias": "". Join (X.text () for x in Response.doc (' a[rel= "v:starring"] "). Items ()),"Score": Response.doc ('. Rating_num '). Text (),"Comments": Response.doc (' HTML > Body > Div#wrapper > Div#content > Div.grid-16-8.clearfix > Div.article > Div#comments-section & Gt DIV.MOD-HD > H2 > I '). Text (), "scenario": Response.doc ( ' html > body > Div#wrapper > Div#content > Div.grid-16-8.clearfix > Div.article > Div.related-info > Div#link-report.indent '). Text (), Span class= "hljs-string" > "IMDb": for x in response.doc ( ' span[href] '). Items ()),} def on_result if not result or not result[ ' title ']: return sql = SQL () sql.replace (

For the above code, the following points need to be explained:
A. In order to avoid the server to determine that the client is in the operation of the crawler, thereby prohibiting IP access (specifically in the presence of 403 forbidden access), we need to make a request to add an HTTP header, disguised as the use of browser access, the following specific usage:

 headers= { "Accept":  "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8", " accept-encoding ": " gzip, deflate, SDCH ", " zh-cn,zh;q=0.8 ", " Cache-control ":  "max-age=0",  "Connection":  "keep-alive",  "user-agent":  "mozilla/5.0 (Windows NT 6.1; WOW64) applewebkit/537.36 (khtml, like Gecko) chrome/41.0.2272.101 safari/537.36 "} crawl_config = {" headers ": Headers, " timeout ": 100}

B. The @every (MINUTES=24 * 60) indicates that it is executed once a day
@config (AGE=10 * 24 * 60 * 60) indicates data expires in 10 days

C. Next is a more important place, overriding the On_result method, equivalent to implementing a polymorphic, the program at the last return, will execute the On_result method, by default, On_result is to brush the data into SQLite, But if we need to insert data into MySQL, we need to rewrite the On_result method, using the following:

    def on_result(self, result):        if not result or not result[‘title‘]: return sql = SQL() sql.replace(‘test‘,**result)

Note that if not result or is not result[' title ': This judgment is important, otherwise it will error, indicating that result is undefined type.

3. In the above amount of code, referring to the instantiation of the invocation of our own implementation of the SQL method, and reference to the From PYSPIDER.DATABASE.MYSQL.MYSQLDB Import SQL this library file, then you must implement this library in this directory, as follows:
Put the following text in the pyspider/pyspider/database/mysql/directory named mysqldb.py

From sixImport ItervaluesImport Mysql.connectorFrom datetimeImport Date, DateTime, TimedeltaClassSql:username =' Root '#数据库用户名 Password =' Root '#数据库密码 Database =' Test '#数据库 host =' 172.30.25.231 '#数据库主机地址 connection ="Connect =True placeholder ='%s 'Def__init__(self):If Self.connect:SQL.connect (self)DefEscape(self,string):Return'%s '% stringDefConnect(self): config = {' User ': Sql.username,' Password ': Sql.password,' Host ': Sql.host}If sql.database! =none:config[' database '] = Sql.databaseTRY:CNX = Mysql.connector.connect (**config) sql.connection = CNXReturnTrueExcept Mysql.connector.ErrorAs ERR:if (Err.errno = = ErrorCode. Er_access_denied_error):Print"The credentials you provided is not correct."elif (Err.errno = = ErrorCode. Er_bad_db_error):Print"The database you provided does not exist."ElsePrint"Something went wrong:", errReturnFalseDefReplace(self,tablename=none,**values):if sql.connection = =‘‘:Print"Please connect First"ReturnFalse tablename = Self.escape (tablename)If values: _keys =",". Join (Self.escape (k)For Kin values) _values =  "REPLACE into%s (%s) VALUES (%s)"% (tablename, _keys, _values) else:sql_query = " REPLACE into%s DEFAULT VALUES "% tablename cur = SQL.connection.cu Rsor () try: if values:cur.execute (Sql_query, List (itervalues (values))) else:cur.execute (sql_query) SQL.connection.commit () return true except mysql.connector.Error as err: print ( "An error occured: {} ". Format (ERR)) return false

Learning Document: http://blog.binux.me/2015/01/pyspider-tutorial-level-1-html-and-css-selector/
Test environment: http://demo.pyspider.org/

Python incremental crawler Pyspider

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More