1. In order to be able to save the crawled data to the local database, now create a MySQL database example locally, and then
Create a table test in the database with the following example:
DROPTABLEIFEXISTS' Test ';CREATETABLE' Douban_db ' (' ID 'int11)NotNULL Auto_increment,' URL 'varchar20)NotNull' Direct 'varchar30),' Performer 'Date' Type 'varchar30),' District 'varchar20)NotNull' Language 'varchar30),' Date 'varchar30), ' time ' varchar (30), ' alias ' varchar (20) not null, ' score ' varchar (30), ' comments ' varchar (300), ' scenario ' varchar (300), ' IMDb ' varchar (30), PRIMARY key ( ' id ')) engine=myisam default CHARSET=utf8 ;
2. If you use the Open source framework Pyspider to do the crawler, by default, the crawled results will be stored in the Result.db sqilite database, but for the convenience of operation, we will store the results in MySQL. Pick up the
One of the things to do is to override the On_result method, instantiate the SQL method that calls our own implementation, specifically
Examples are as follows:
#!/usr/bin/env python#-*-Encoding:utf-8-*-# Created on 2015-03-20 09:46:20# Project:fly_spiderImport reFrom Pyspider.database.mysql.mysqldbImport SQLFrom Pyspider.libs.base_handlerImport *ClassHandler(Basehandler): headers= {"Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8","Accept-encoding":"Gzip, deflate, SDCH","Accept-language":"zh-cn,zh;q=0.8","Cache-control":"Max-age=0","Connection":"Keep-alive","User-agent":"mozilla/5.0 (Windows NT 6.1; WOW64) applewebkit/537.36 (khtml, like Gecko) chrome/41.0.2272.101 safari/537.36 "} crawl_config = {"Headers": Headers,"Timeout":100}@every (MINUTES=24 * 60)DefOn_Start(self): Self.crawl (' http://movie.douban.com/tag/', callback=self.index_page)@config (AGE=10 * 24 * 60 * 60)DefIndex_page(Self, Response):For eachIn Response.doc (' a[href^= "http"]. Items ():If Re.match ("http://movie.douban.com/tag/\w+", Each.attr.href, re. U): Self.crawl (Each.attr.href, Callback=self.list_page)@config (age=10*24*60*60, priority=2)DefList_page(Self, Response):For eachIn Response.doc (' HTML > Body > Div#wrapper > Div#content > Div.grid-16-8.clearfix > Div.article > div > Table Tr.item > td > Div.pl2 > A '). Items (): Self.crawl (Each.attr.href, priority=9, Callback=self.detail_page)@config (priority=3)DefDetail_page(Self, Response):return {"url": Response.url,"title": Response.doc (' HTML > Body > #wrapper > #content > H1 > Span '). Text (),"Direct":",". Join (X.text ()For XIn Response.doc (' A[rel= ' V:directedby "]). Items ()),"Performer":",". Join (X.text ()For XIn Response.doc (' A[rel= ' v:starring "]). Items ()),"Type":",". Join (X.text ()For XIn Response.doc (' span[property= ' V:genre "]). Items ()),# "District": "". Join (X.text () for x in Response.doc (' a[rel= "v:starring"] '). Items ()),# "Language": "". Join (X.text () for x in Response.doc (' a[rel= "v:starring"] '). Items ()),"Date":",". Join (X.text ()For XIn Response.doc (' span[property= ' V:initialreleasedate "]). Items ()),"Time":",". Join (X.text ()For XIn Response.doc (' span[property= ' V:runtime "]). Items ()),# "Alias": "". Join (X.text () for x in Response.doc (' a[rel= "v:starring"] "). Items ()),"Score": Response.doc ('. Rating_num '). Text (),"Comments": Response.doc (' HTML > Body > Div#wrapper > Div#content > Div.grid-16-8.clearfix > Div.article > Div#comments-section & Gt DIV.MOD-HD > H2 > I '). Text (), "scenario": Response.doc ( ' html > body > Div#wrapper > Div#content > Div.grid-16-8.clearfix > Div.article > Div.related-info > Div#link-report.indent '). Text (), Span class= "hljs-string" > "IMDb": for x in response.doc ( ' span[href] '). Items ()),} def on_result if not result or not result[ ' title ']: return sql = SQL () sql.replace (
For the above code, the following points need to be explained:
A. In order to avoid the server to determine that the client is in the operation of the crawler, thereby prohibiting IP access (specifically in the presence of 403 forbidden access), we need to make a request to add an HTTP header, disguised as the use of browser access, the following specific usage:
headers= { "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8", " accept-encoding ": " gzip, deflate, SDCH ", " zh-cn,zh;q=0.8 ", " Cache-control ": "max-age=0", "Connection": "keep-alive", "user-agent": "mozilla/5.0 (Windows NT 6.1; WOW64) applewebkit/537.36 (khtml, like Gecko) chrome/41.0.2272.101 safari/537.36 "} crawl_config = {" headers ": Headers, " timeout ": 100}
B. The @every (MINUTES=24 * 60) indicates that it is executed once a day
@config (AGE=10 * 24 * 60 * 60) indicates data expires in 10 days
C. Next is a more important place, overriding the On_result method, equivalent to implementing a polymorphic, the program at the last return, will execute the On_result method, by default, On_result is to brush the data into SQLite, But if we need to insert data into MySQL, we need to rewrite the On_result method, using the following:
def on_result(self, result): if not result or not result[‘title‘]: return sql = SQL() sql.replace(‘test‘,**result)
Note that if not result or is not result[' title ': This judgment is important, otherwise it will error, indicating that result is undefined type.
3. In the above amount of code, referring to the instantiation of the invocation of our own implementation of the SQL method, and reference to the From PYSPIDER.DATABASE.MYSQL.MYSQLDB Import SQL this library file, then you must implement this library in this directory, as follows:
Put the following text in the pyspider/pyspider/database/mysql/directory named mysqldb.py
From sixImport ItervaluesImport Mysql.connectorFrom datetimeImport Date, DateTime, TimedeltaClassSql:username =' Root '#数据库用户名 Password =' Root '#数据库密码 Database =' Test '#数据库 host =' 172.30.25.231 '#数据库主机地址 connection ="Connect =True placeholder ='%s 'Def__init__(self):If Self.connect:SQL.connect (self)DefEscape(self,string):Return'%s '% stringDefConnect(self): config = {' User ': Sql.username,' Password ': Sql.password,' Host ': Sql.host}If sql.database! =none:config[' database '] = Sql.databaseTRY:CNX = Mysql.connector.connect (**config) sql.connection = CNXReturnTrueExcept Mysql.connector.ErrorAs ERR:if (Err.errno = = ErrorCode. Er_access_denied_error):Print"The credentials you provided is not correct."elif (Err.errno = = ErrorCode. Er_bad_db_error):Print"The database you provided does not exist."ElsePrint"Something went wrong:", errReturnFalseDefReplace(self,tablename=none,**values):if sql.connection = =‘‘:Print"Please connect First"ReturnFalse tablename = Self.escape (tablename)If values: _keys =",". Join (Self.escape (k)For Kin values) _values = "REPLACE into%s (%s) VALUES (%s)"% (tablename, _keys, _values) else:sql_query = " REPLACE into%s DEFAULT VALUES "% tablename cur = SQL.connection.cu Rsor () try: if values:cur.execute (Sql_query, List (itervalues (values))) else:cur.execute (sql_query) SQL.connection.commit () return true except mysql.connector.Error as err: print ( "An error occured: {} ". Format (ERR)) return false
Learning Document: http://blog.binux.me/2015/01/pyspider-tutorial-level-1-html-and-css-selector/
Test environment: http://demo.pyspider.org/
Python incremental crawler Pyspider