The front small tried a little bit scrapy Crawl blog site blog (you can see the Scrapy crawler growth Diary of the creation of the project-extract data-save in JSON format data), but the previous data is saved in JSON format in a text file. This is obviously not enough for our everyday applications, so let's look at how to keep the crawled content in a common MySQL database.
Description: All operations are done on the basis of "Scrapy Crawler growth Diary-extracting data-saving data in JSON format", if you miss this article you can view the creation of the Scrapy crawler growth diary-Extract data-save data in JSON format
Environment: Mysql5.1.67-log
Operation Steps:
1. Check if Python supports MySQL
[[email protected] ~]# Pythonpython 2.7.10 (default, June 5, 17:56:24) [GCC 4.4.4 20100726 (Red Hat 4.4.4-13)] on Linux2type "Help", "copyright", "credits" or "license" for more information.>>> import Mysqldbtraceback (Most rec ENT call last): File ' <stdin> ', line 1, in <module>importerror:no module named MySQLdb
If: Importerror:no module named MYSQLDB indicates that Python does not support MySQL and requires manual installation, refer to step 2, if no error is found, please go to step 3
2, Python installation MySQL support
[Email protected] ~]# pip install mysql-pythoncollecting mysql-python downloading Mysql-python-1.2.5.zip (108kB) c1/>100% |████████████████████████████████| 110kB 115kb/s Building Wheels for collected Packages:mysql-python Running setup.py bdist_wheel for Mysql-python Stored in directory:/root/.cache/pip/wheels/8c/0d/11/ D654cad764b92636ce047897dd2b9e1b0cd76c22f813c5851asuccessfully built mysql-pythoninstalling collected packages: Mysql-pythonsuccessfully installed mysql-python-1.2.5
Run Step 1 again after installation to check if Python already supports MySQL
If there is a problem you can try: lc_all=c pip install Mysql-python
If you still have an error: Error:python.h:no such file or directory
You can try to install Python-devel first:
Yum Install Python-devel
3. Creating databases and Tables
CREATE DATABASE cnblogsdb DEFAULT CHARACTER SET UTF8 COLLATE utf8_general_ci; CREATE TABLE ' cnblogsinfo ' ( ' linkmd5id ' char (+) not NULL COMMENT ' URL MD5 encoded ID ', ' title ' text COMMENT ' title ', ' Description ' text COMMENT ' description ', ' link ' text COMMENT ' url link ', ' listurl ' text COMMENT ' paging url ', ' Updated ' datetime DEFAULT NULL COMMENT ' Last update Time ', PRIMARY KEY (' Linkmd5id ')) Engine=myisam DEFAULT Charset=utf8;
Attention:
A), the creation of the database with the default CHARACTER SET UTF8 COLLATE utf8_general_ci, so as not to appear garbled. I've been doing this for a long time.
b), the database table is encoded as UTF8
4. Setting up MySQL configuration information
According to the previous article (Scrapy Crawler growth diary Creation project-extract data-Save as JSON format data) we can know that the final scrapy is through the pipelines.py to the results of processing. Obviously to be saved to the MySQL database, it is unavoidable to modify the pipelines.py file. However, in the MySQL operation, we need to connect to the database, this time design to the database connection string problem. We can write to die directly in the pipelines.py file, but this is not conducive to the maintenance of the program, so we can consider writing configuration information in the project's configuration file settings.py.
The following configuration items are added to the settings.py
# Start MySQL Database Configure settingmysql_host = ' localhost ' mysql_dbname = ' cnblogsdb ' mysql_user = ' root ' mysql_passwd = ' Root ' # End of MySQL database configure setting
5, modify the pipelines.py
The result of the modification is as follows, two classes are defined in the pipelines.py to be noted. Jsonwithencodingcnblogspipeline is written to the JSON file, and mysqlstorecnblogspipeline (you need to remember, it will be used later!). ) is used to write to the database.
The main functions of the Mysqlstorecnblogspipeline class are
A), read the database configuration file, and generate the database instance, mainly through the class method From_settings implementation,
b), if the URL does not exist, write directly, if the URL exists is updated, through a custom method _do_upinsert implementation,
c), the MD5 function _get_linkmd5id that ensures the uniqueness of the URL.
[[email protected] cnblogs]# more pipelines.py#-*-coding:utf-8-*-# Define Your item pipelines here## Don ' t forget To add your pipeline to the Item_pipelines setting# see:http://doc.scrapy.org/en/latest/topics/item-pipeline.htmlfrom s Crapy import signalsimport jsonimport codecsfrom twisted.enterprise import adbapifrom datetime import Datetimefrom Hashli B Import Md5import Mysqldbimport Mysqldb.cursorsclass jsonwithencodingcnblogspipeline (object): Def __init__ (self): Self.file = Codecs.open (' Cnblogs.json ', ' W ', encoding= ' Utf-8 ') def process_item (self, item, spider): line = j Son.dumps (Dict (item), ensure_ascii=false) + "\ n" self.file.write (line) return item def spider_closed (self , spider): Self.file.close () class Mysqlstorecnblogspipeline (object): Def __init__ (self, dbpool): SELF.DBP Ool = Dbpool @classmethod def from_settings (CLS, settings): Dbargs = Dict (host=settings[' mysq L_host '], db=settings[' Mysql_dbname ', user=settings[' Mysql_user '], passwd=settings[' mysql_passwd ', cha rset= ' UTF8 ', Cursorclass = MySQLdb.cursors.DictCursor, use_unicode= True, Dbpool = a Dbapi. ConnectionPool (' MySQLdb ', **dbargs) return CLS (Dbpool) #pipeline默认调用 def process_item (self, item, spider): D = self.dbpool.runInteraction (Self._do_upinsert, item, spider) D.adderrback (Self._handle_error, item, spider) D.addboth (Lambda _: Item) return d #将每行更新或写入数据库中 def _do_upinsert (self, conn, item, spider): l Inkmd5id = Self._get_linkmd5id (item) #print Linkmd5id now = Datetime.utcnow (). replace (microsecond=0). isoform At (') Conn.execute ("" "Select 1 from cnblogsinfo where Linkmd5id =%s" "", (Linkmd5id,)) ret = Conn.fetchone () If Ret:conn.execute ("" "Update cnblogsinfo Set title =%s, D Escription =%s, lInk =%s, listURL =%s, updated =%s where Linkmd5id =%s "", (item[' title '), item[' desc '], item[' link ', item [' listURL '], now, linkmd5id)) #print "" "# update Cnblogsinfo set title =%s, Description =%s, l Ink =%s, listURL =%s, updated =%s where linkmd5id =%s # "", (item[' title '), item[' desc '], item[' link ', ite m[' listURL '], now, Linkmd5id) Else:conn.execute ("" "INSERT into Cnblogsinfo (Linkmd5id, t Itle, description, link, listURL, updated) values (%s,%s,%s,%s,%s,%s) "" ", (Linkmd5id, item [' title '], item[' desc '], item[' link '], item[' listURL '], now)) #print "" "# INSERT INTO Cnblogsinf O (linkmd5id, title, description, Link, listURL, updated) # values (%s,%s,%s,%s,%s,%s) # "", ( Linkmd5id, item[' title '], item[' desc '], item[' link ', item[' listURL '], now) #获取url的md5编码 def _get_linkmd5id (self, it EM): #url进行md5处理,To avoid duplicate acquisition design return MD5 (item[' link '). Hexdigest () #异常处理 def _handle_error (self, failue, item, spider): Lo G.err (Failure)
6. Enable the Mysqlstorecnblogspipeline class and let it work.
Modify the setting.py configuration file to add support for Mysqlstorecnblogspipeline
Item_pipelines = { ' cnblogs.pipelines.JsonWithEncodingCnblogsPipeline ': +, ' Cnblogs.pipelines.MySQLStoreCnblogsPipeline ': 300,}
At this point, all the files that need to be modified are modified, the following test to see how the results.
7. Testing
[Email protected] cnblogs]# scrapy crawl Cnblogsspider
To view Database results:
At this point, scrapy Crawl Web content to write to the database has been implemented. However, the function of the crawler is too weak, the most basic file download, distributed crawl and other functions are not available, but also imagine a lot of web site anti-crawler crawl, in case we encounter such a site how to deal with it? In the next period of time, we will solve these problems individually. Imagine if the crawler is strong enough to have enough content; Can we build a vertical search engine of our own? Think on the excitement, enjoy YY go!!!
Last source updated to: Https://github.com/jackgitgz/CnblogsSpider
http://www.w2bc.com/Article/44862
Scrapy crawler growth Diary write crawl content to MySQL database