Scrapy crawler growth Diary write crawl content to MySQL database

Last Update:2015-10-30 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

The front small tried a little bit scrapy Crawl blog site blog (you can see the Scrapy crawler growth Diary of the creation of the project-extract data-save in JSON format data), but the previous data is saved in JSON format in a text file. This is obviously not enough for our everyday applications, so let's look at how to keep the crawled content in a common MySQL database.

Description: All operations are done on the basis of "Scrapy Crawler growth Diary-extracting data-saving data in JSON format", if you miss this article you can view the creation of the Scrapy crawler growth diary-Extract data-save data in JSON format

Environment: Mysql5.1.67-log

Operation Steps:

1. Check if Python supports MySQL

[[email protected] ~]# Pythonpython 2.7.10 (default, June  5, 17:56:24) [GCC 4.4.4 20100726 (Red Hat 4.4.4-13)] on Linux2type "Help", "copyright", "credits" or "license" for more information.>>> import Mysqldbtraceback (Most rec ENT call last):  File ' <stdin> ', line 1, in <module>importerror:no module named MySQLdb

If: Importerror:no module named MYSQLDB indicates that Python does not support MySQL and requires manual installation, refer to step 2, if no error is found, please go to step 3

2, Python installation MySQL support

[Email protected] ~]# pip install mysql-pythoncollecting mysql-python  downloading Mysql-python-1.2.5.zip (108kB) c1/>100% |████████████████████████████████| 110kB 115kb/s Building Wheels for collected Packages:mysql-python  Running setup.py bdist_wheel for Mysql-python  Stored in directory:/root/.cache/pip/wheels/8c/0d/11/ D654cad764b92636ce047897dd2b9e1b0cd76c22f813c5851asuccessfully built mysql-pythoninstalling collected packages: Mysql-pythonsuccessfully installed mysql-python-1.2.5

Run Step 1 again after installation to check if Python already supports MySQL

If there is a problem you can try: lc_all=c pip install Mysql-python
If you still have an error: Error:python.h:no such file or directory
You can try to install Python-devel first:

Yum Install Python-devel

3. Creating databases and Tables

CREATE DATABASE cnblogsdb DEFAULT CHARACTER SET UTF8 COLLATE utf8_general_ci; CREATE TABLE ' cnblogsinfo ' (  ' linkmd5id ' char (+) not NULL COMMENT ' URL MD5 encoded ID ', ' title  ' text COMMENT ' title ',  ' Description ' text COMMENT ' description ',  ' link ' text  COMMENT ' url link ',  ' listurl ' text  COMMENT ' paging url ',  ' Updated ' datetime DEFAULT NULL  COMMENT ' Last update Time ',  PRIMARY KEY (' Linkmd5id ')) Engine=myisam DEFAULT Charset=utf8;

Attention:

A), the creation of the database with the default CHARACTER SET UTF8 COLLATE utf8_general_ci, so as not to appear garbled. I've been doing this for a long time.

b), the database table is encoded as UTF8

4. Setting up MySQL configuration information

According to the previous article (Scrapy Crawler growth diary Creation project-extract data-Save as JSON format data) we can know that the final scrapy is through the pipelines.py to the results of processing. Obviously to be saved to the MySQL database, it is unavoidable to modify the pipelines.py file. However, in the MySQL operation, we need to connect to the database, this time design to the database connection string problem. We can write to die directly in the pipelines.py file, but this is not conducive to the maintenance of the program, so we can consider writing configuration information in the project's configuration file settings.py.

The following configuration items are added to the settings.py

# Start MySQL Database Configure settingmysql_host = ' localhost ' mysql_dbname = ' cnblogsdb ' mysql_user = ' root ' mysql_passwd = ' Root ' # End of MySQL database configure setting

5, modify the pipelines.py

The result of the modification is as follows, two classes are defined in the pipelines.py to be noted. Jsonwithencodingcnblogspipeline is written to the JSON file, and mysqlstorecnblogspipeline (you need to remember, it will be used later!). ) is used to write to the database.

The main functions of the Mysqlstorecnblogspipeline class are

A), read the database configuration file, and generate the database instance, mainly through the class method From_settings implementation,

b), if the URL does not exist, write directly, if the URL exists is updated, through a custom method _do_upinsert implementation,

c), the MD5 function _get_linkmd5id that ensures the uniqueness of the URL.

[[email protected] cnblogs]# more pipelines.py#-*-coding:utf-8-*-# Define Your item pipelines here## Don ' t forget To add your pipeline to the Item_pipelines setting# see:http://doc.scrapy.org/en/latest/topics/item-pipeline.htmlfrom s Crapy import signalsimport jsonimport codecsfrom twisted.enterprise import adbapifrom datetime import Datetimefrom Hashli        B Import Md5import Mysqldbimport Mysqldb.cursorsclass jsonwithencodingcnblogspipeline (object): Def __init__ (self): Self.file = Codecs.open (' Cnblogs.json ', ' W ', encoding= ' Utf-8 ') def process_item (self, item, spider): line = j Son.dumps (Dict (item), ensure_ascii=false) + "\ n" self.file.write (line) return item def spider_closed (self , spider): Self.file.close () class Mysqlstorecnblogspipeline (object): Def __init__ (self, dbpool): SELF.DBP Ool = Dbpool @classmethod def from_settings (CLS, settings): Dbargs = Dict (host=settings[' mysq L_host '], db=settings[' Mysql_dbname ', user=settings[' Mysql_user '], passwd=settings[' mysql_passwd ', cha rset= ' UTF8 ', Cursorclass = MySQLdb.cursors.DictCursor, use_unicode= True, Dbpool = a Dbapi.        ConnectionPool (' MySQLdb ', **dbargs) return CLS (Dbpool) #pipeline默认调用 def process_item (self, item, spider):         D = self.dbpool.runInteraction (Self._do_upinsert, item, spider) D.adderrback (Self._handle_error, item, spider) D.addboth (Lambda _: Item) return d #将每行更新或写入数据库中 def _do_upinsert (self, conn, item, spider): l Inkmd5id = Self._get_linkmd5id (item) #print Linkmd5id now = Datetime.utcnow (). replace (microsecond=0). isoform        At (') Conn.execute ("" "Select 1 from cnblogsinfo where Linkmd5id =%s" "", (Linkmd5id,)) ret = Conn.fetchone () If Ret:conn.execute ("" "Update cnblogsinfo Set title =%s, D Escription =%s, lInk =%s, listURL =%s, updated =%s where Linkmd5id =%s "", (item[' title '), item[' desc '], item[' link ', item [' listURL '], now, linkmd5id)) #print "" "# update Cnblogsinfo set title =%s, Description =%s, l Ink =%s, listURL =%s, updated =%s where linkmd5id =%s # "", (item[' title '), item[' desc '], item[' link ', ite m[' listURL '], now, Linkmd5id) Else:conn.execute ("" "INSERT into Cnblogsinfo (Linkmd5id, t Itle, description, link, listURL, updated) values (%s,%s,%s,%s,%s,%s) "" ", (Linkmd5id, item [' title '], item[' desc '], item[' link '], item[' listURL '], now)) #print "" "# INSERT INTO Cnblogsinf O (linkmd5id, title, description, Link, listURL, updated) # values (%s,%s,%s,%s,%s,%s) # "", ( Linkmd5id, item[' title '], item[' desc '], item[' link ', item[' listURL '], now) #获取url的md5编码 def _get_linkmd5id (self, it EM): #url进行md5处理,To avoid duplicate acquisition design return MD5 (item[' link '). Hexdigest () #异常处理 def _handle_error (self, failue, item, spider): Lo G.err (Failure)

6. Enable the Mysqlstorecnblogspipeline class and let it work.

Modify the setting.py configuration file to add support for Mysqlstorecnblogspipeline

Item_pipelines = {    ' cnblogs.pipelines.JsonWithEncodingCnblogsPipeline ': +,    ' Cnblogs.pipelines.MySQLStoreCnblogsPipeline ': 300,}

At this point, all the files that need to be modified are modified, the following test to see how the results.

7. Testing

[Email protected] cnblogs]# scrapy crawl Cnblogsspider

To view Database results:

At this point, scrapy Crawl Web content to write to the database has been implemented. However, the function of the crawler is too weak, the most basic file download, distributed crawl and other functions are not available, but also imagine a lot of web site anti-crawler crawl, in case we encounter such a site how to deal with it? In the next period of time, we will solve these problems individually. Imagine if the crawler is strong enough to have enough content; Can we build a vertical search engine of our own? Think on the excitement, enjoy YY go!!!

Last source updated to: Https://github.com/jackgitgz/CnblogsSpider

http://www.w2bc.com/Article/44862

Scrapy crawler growth Diary write crawl content to MySQL database

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Scrapy crawler growth Diary write crawl content to MySQL database

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Scrapy crawler growth Diary write crawl content to MySQL database

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support