Python incremental crawler Pyspider

Source: Internet
Author: User
Tags webp

1. In order to be able to save the crawled data to the local database, now create a MySQL database example locally, and then
Create a table test in the database with the following example:

DROPTABLEIFEXISTS' Test ';CREATETABLE' Douban_db ' (' ID 'int11)NotNULL Auto_increment,' URL 'varchar20)NotNull' Direct 'varchar30),' Performer 'Date' Type 'varchar30),' District 'varchar20)NotNull' Language 'varchar30),' Date 'varchar30), ' time ' varchar (30),  ' alias ' varchar (20) not null,  ' score ' varchar (30),  ' comments ' varchar (300),  ' scenario ' varchar (300),  ' IMDb ' varchar (30), PRIMARY key ( ' id ')) engine=myisam default CHARSET=utf8 ; 

2. If you use the Open source framework Pyspider to do the crawler, by default, the crawled results will be stored in the Result.db sqilite database, but for the convenience of operation, we will store the results in MySQL. Pick up the
One of the things to do is to override the On_result method, instantiate the SQL method that calls our own implementation, specifically
Examples are as follows:

#!/usr/bin/env python#-*-Encoding:utf-8-*-# Created on 2015-03-20 09:46:20# Project:fly_spiderImport reFrom Pyspider.database.mysql.mysqldbImport SQLFrom Pyspider.libs.base_handlerImport *ClassHandler(Basehandler): headers= {"Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8","Accept-encoding":"Gzip, deflate, SDCH","Accept-language":"zh-cn,zh;q=0.8","Cache-control":"Max-age=0","Connection":"Keep-alive","User-agent":"mozilla/5.0 (Windows NT 6.1; WOW64) applewebkit/537.36 (khtml, like Gecko) chrome/41.0.2272.101 safari/537.36 "} crawl_config = {"Headers": Headers,"Timeout":100}@every (MINUTES=24 * 60)DefOn_Start(self): Self.crawl (' http://movie.douban.com/tag/', callback=self.index_page)@config (AGE=10 * 24 * 60 * 60)DefIndex_page(Self, Response):For eachIn Response.doc (' a[href^= "http"]. Items ():If Re.match ("http://movie.douban.com/tag/\w+", Each.attr.href, re. U): Self.crawl (Each.attr.href, Callback=self.list_page)@config (age=10*24*60*60, priority=2)DefList_page(Self, Response):For eachIn Response.doc (' HTML > Body > Div#wrapper > Div#content > Div.grid-16-8.clearfix > Div.article > div > Table Tr.item > td > Div.pl2 > A '). Items (): Self.crawl (Each.attr.href, priority=9, Callback=self.detail_page)@config (priority=3)DefDetail_page(Self, Response):return {"url": Response.url,"title": Response.doc (' HTML > Body > #wrapper > #content > H1 > Span '). Text (),"Direct":",". Join (X.text ()For XIn Response.doc (' A[rel= ' V:directedby "]). Items ()),"Performer":",". Join (X.text ()For XIn Response.doc (' A[rel= ' v:starring "]). Items ()),"Type":",". Join (X.text ()For XIn Response.doc (' span[property= ' V:genre "]). Items ()),# "District": "". Join (X.text () for x in Response.doc (' a[rel= "v:starring"] '). Items ()),# "Language": "". Join (X.text () for x in Response.doc (' a[rel= "v:starring"] '). Items ()),"Date":",". Join (X.text ()For XIn Response.doc (' span[property= ' V:initialreleasedate "]). Items ()),"Time":",". Join (X.text ()For XIn Response.doc (' span[property= ' V:runtime "]). Items ()),# "Alias": "". Join (X.text () for x in Response.doc (' a[rel= "v:starring"] "). Items ()),"Score": Response.doc ('. Rating_num '). Text (),"Comments": Response.doc (' HTML > Body > Div#wrapper > Div#content > Div.grid-16-8.clearfix > Div.article > Div#comments-section & Gt DIV.MOD-HD > H2 > I '). Text (), "scenario": Response.doc ( ' html > body > Div#wrapper > Div#content > Div.grid-16-8.clearfix > Div.article > Div.related-info > Div#link-report.indent '). Text (), Span class= "hljs-string" > "IMDb": for x in response.doc ( ' span[href] '). Items ()),} def on_result if not result or not result[ ' title ']: return sql = SQL () sql.replace (   

For the above code, the following points need to be explained:
A. In order to avoid the server to determine that the client is in the operation of the crawler, thereby prohibiting IP access (specifically in the presence of 403 forbidden access), we need to make a request to add an HTTP header, disguised as the use of browser access, the following specific usage:

 headers= { "Accept":  "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8", " accept-encoding ": " gzip, deflate, SDCH ", " zh-cn,zh;q=0.8 ", " Cache-control ":  "max-age=0",  "Connection":  "keep-alive",  "user-agent":  "mozilla/5.0 (Windows NT 6.1; WOW64) applewebkit/537.36 (khtml, like Gecko) chrome/41.0.2272.101 safari/537.36 "} crawl_config = {" headers ": Headers, " timeout ": 100}    

B. The @every (MINUTES=24 * 60) indicates that it is executed once a day
@config (AGE=10 * 24 * 60 * 60) indicates data expires in 10 days

C. Next is a more important place, overriding the On_result method, equivalent to implementing a polymorphic, the program at the last return, will execute the On_result method, by default, On_result is to brush the data into SQLite, But if we need to insert data into MySQL, we need to rewrite the On_result method, using the following:

    def on_result(self, result):        if not result or not result[‘title‘]: return sql = SQL() sql.replace(‘test‘,**result) 

Note that if not result or is not result[' title ': This judgment is important, otherwise it will error, indicating that result is undefined type.

3. In the above amount of code, referring to the instantiation of the invocation of our own implementation of the SQL method, and reference to the From PYSPIDER.DATABASE.MYSQL.MYSQLDB Import SQL this library file, then you must implement this library in this directory, as follows:
Put the following text in the pyspider/pyspider/database/mysql/directory named mysqldb.py

From sixImport ItervaluesImport Mysql.connectorFrom datetimeImport Date, DateTime, TimedeltaClassSql:username =' Root '#数据库用户名 Password =' Root '#数据库密码 Database =' Test '#数据库 host =' 172.30.25.231 '#数据库主机地址 connection ="Connect =True placeholder ='%s 'Def__init__(self):If Self.connect:SQL.connect (self)DefEscape(self,string):Return'%s '% stringDefConnect(self): config = {' User ': Sql.username,' Password ': Sql.password,' Host ': Sql.host}If sql.database! =none:config[' database '] = Sql.databaseTRY:CNX = Mysql.connector.connect (**config) sql.connection = CNXReturnTrueExcept Mysql.connector.ErrorAs ERR:if (Err.errno = = ErrorCode. Er_access_denied_error):Print"The credentials you provided is not correct."elif (Err.errno = = ErrorCode. Er_bad_db_error):Print"The database you provided does not exist."ElsePrint"Something went wrong:", errReturnFalseDefReplace(self,tablename=none,**values):if sql.connection = =‘‘:Print"Please connect First"ReturnFalse tablename = Self.escape (tablename)If values: _keys =",". Join (Self.escape (k)For Kin values) _values =  "REPLACE into%s (%s) VALUES (%s)"% (tablename, _keys, _values) else:sql_query = " REPLACE into%s DEFAULT VALUES "% tablename cur = SQL.connection.cu Rsor () try: if values:cur.execute (Sql_query, List (itervalues (values))) else:cur.execute (sql_query) SQL.connection.commit () return true except mysql.connector.Error as err: print ( "An error occured: {} ". Format (ERR)) return false     

Learning Document: http://blog.binux.me/2015/01/pyspider-tutorial-level-1-html-and-css-selector/
Test environment: http://demo.pyspider.org/

Python incremental crawler Pyspider

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.