Scrapy crawler growth Diary write crawl content to MySQL database

Source: Internet
Author: User

The front small tried a little bit scrapy Crawl blog site blog (you can see the Scrapy crawler growth Diary of the creation of the project-extract data-save in JSON format data), but the previous data is saved in JSON format in a text file. This is obviously not enough for our everyday applications, so let's look at how to keep the crawled content in a common MySQL database.

Description: All operations are done on the basis of "Scrapy Crawler growth Diary-extracting data-saving data in JSON format", if you miss this article you can view the creation of the Scrapy crawler growth diary-Extract data-save data in JSON format

Environment: Mysql5.1.67-log

Operation Steps:

1. Check if Python supports MySQL

[[Email protected] ~]# Pythonpython2.7.Ten(Default, June5  -, -: About: -) [GCC4.4.4 20100726(Red Hat4.4.4- -)] on Linux2type" Help","Copyright","credits"Or"License"  for  Moreinformation.>>>Import Mysqldbtraceback (most recent call Last): File"<stdin>", line1,inch<module>importerror:no module named MySQLdb

If: Importerror:no module named MYSQLDB indicates that Python does not support MySQL and requires manual installation, refer to step 2, if no error is found, please go to step 3

2, Python installation MySQL support

[[Email protected] ~]# pipInstallmysql-pythoncollecting MySQL-python downloading MySQL-python-1.2.5.Zip(108kB) -% |████████████████████████████████| 110kB 115kb/s Building Wheels forCollected packages:mysql-python Running setup.py bdist_wheel formysql-python StoredinchDirectory:/root/.cache/pip/wheels/8c/0d/ One/d654cad764b92636ce047897dd2b9e1b0cd76c22f813c5851asuccessfully built MySQL-pythoninstalling collected Packages:mysql-pythonsuccessfully installed MySQL-python-1.2.5

Run Step 1 again after installation to check if Python already supports MySQL

If there is a problem you can try: lc_all=c pip install Mysql-python
If you still have an error: Error:python.h:no such file or directory
You can try to install Python-devel first:

Yum Install Python-devel

3. Creating databases and Tables

CREATE DATABASECnblogsdbDEFAULT CHARACTER SETUTF8 COLLATE utf8_general_ci;CREATE TABLE' cnblogsinfo ' (' Linkmd5id ' )Char( +) not NULLCOMMENT'URL MD5 encoding ID', ' title 'textCOMMENT'title', ' description 'textCOMMENT'Description', ' link 'textCOMMENT'URL link', ' listURL 'textCOMMENT'Paging URL link', ' Updated 'datetime DEFAULT NULLCOMMENT'Last update Time',  PRIMARY KEY(' Linkmd5id ')) ENGINE=MyISAMDEFAULTCHARSET=UTF8;

Attention:

A), the creation of the database with the DEFAULT CHARACTER SET UTF8 COLLATE utf8_general_ci, so as not to appear garbled. I've been doing this for a long time.

b), the database table is encoded as UTF8

4. Setting up MySQL configuration information

According to the previous article (Scrapy Crawler growth diary Creation project-extract data-Save as JSON format data) we can know that the final scrapy is through the pipelines.py to the results of processing. Obviously to be saved to the MySQL database, it is unavoidable to modify the pipelines.py file. However, in the MySQL operation, we need to connect to the database, this time design to the database connection string problem. We can write to die directly in the pipelines.py file, but this is not conducive to the maintenance of the program, so we can consider writing configuration information in the project's configuration file settings.py.

The following configuration items are added to the settings.py

' localhost '  'cnblogsdb' root'  root'# End of MySQL database Configure setting

5, modify the pipelines.py

The result of the modification is as follows, two classes are defined in the pipelines.py to be noted. Jsonwithencodingcnblogspipeline is written to the JSON file, and mysqlstorecnblogspipeline(you need to remember, it will be used later!). ) is used to write to the database.

The main functions of the Mysqlstorecnblogspipeline class are

A), read the database configuration file, and generate the database instance, mainly through the class method From_settings implementation,

b), if the URL does not exist, write directly, if the URL exists is updated, through a custom method _do_upinsert implementation,

c), the MD5 function _get_linkmd5id that ensures the uniqueness of the URL.

[[email protected] cnblogs]#More pipelines.py#-*-coding:utf-8-*-#Define your item pipelines here##Don ' t forget to add your pipeline to the Item_pipelines setting#see:http://doc.scrapy.org/en/latest/topics/item-pipeline.html fromScrapyImportSignalsImportJSONImportCodecs fromTwisted.enterpriseImportAdbapi fromDatetimeImportdatetime fromHashlibImportMD5ImportMySQLdbImportmysqldb.cursorsclassJsonwithencodingcnblogspipeline (object):def __init__(self): Self.file= Codecs.open ('Cnblogs.json','W', encoding='Utf-8')    defProcess_item (self, item, spider): line= Json.dumps (Dict (item), Ensure_ascii=false) +"\ n"Self.file.write (line)returnItemdefspider_closed (self, Spider): Self.file.close ()classMysqlstorecnblogspipeline (object):def __init__(self, dbpool): Self.dbpool=Dbpool @classmethoddeffrom_settings (CLS, settings): Dbargs=Dict (Host=settings['Mysql_host'], DB=settings['Mysql_dbname'], user=settings['Mysql_user'], passwd=settings['mysql_passwd'], CharSet='UTF8', Cursorclass=MySQLdb.cursors.DictCursor, Use_unicode=True,) Dbpool= Adbapi. ConnectionPool ('MySQLdb', **Dbargs)returnCLS (dbpool)#Pipeline Default Call    defProcess_item (self, item, spider): D=self.dbpool.runInteraction (Self._do_upinsert, item, spider) D.adderrback (Self._handle_error, item, spider) D.addboth (Lambda_: Item)returnD#update or write each row to the database    def_do_upinsert (SELF, conn, item, spider): Linkmd5id=self._get_linkmd5id (item)#Print Linkmd5idnow = Datetime.utcnow (). replace (microsecond=0). Isoformat (' ') Conn.execute ("""Select 1 from cnblogsinfo where Linkmd5id =%s""", (Linkmd5id,)) RET=Conn.fetchone ()ifRet:conn.execute ("""Update Cnblogsinfo Set title =%s, Description =%s, link =%s, listURL =%s, updated =%s where LINKM D5id =%s""", (item['title'], item['desc'], item['Link'], item['listURL'], now, linkmd5id))#print "" "            #Update Cnblogsinfo Set title =%s, Description =%s, link =%s, listURL =%s, updated =%s where linkmd5id =%s            #"" " , (item[' title '], item[' desc '), item[' link '), item[' listURL '], now, Linkmd5id)        Else: Conn.execute ("""INSERT INTO Cnblogsinfo (Linkmd5id, title, description, Link, listURL, updated) VALUES ( %s ,%s,%s,%s,%s,%s)""", (Linkmd5id, item['title'], item['desc'], item['Link'], item['listURL'], now))#print "" "            #INSERT INTO Cnblogsinfo (Linkmd5id, title, description, Link, listURL, updated)            #values (%s ,%s,%s,%s,%s,%s)            #"" , (Linkmd5id, item[' title '], item[' desc '), item[' link '), item[' listURL '], now)    #gets the MD5 encoding of the URL    def_get_linkmd5id (self, item):#URL for MD5 processing, to avoid duplicate acquisition design        returnMD5 (item['Link']). Hexdigest ()#Exception Handling    def_handle_error (self, failue, item, spider): Log.err (failure)

6. Enable the Mysqlstorecnblogspipeline class and let it work.

Modify the setting.py configuration file to add support for Mysqlstorecnblogspipeline

Item_pipelines = {    'cnblogs.pipelines.JsonWithEncodingCnblogsPipeline': ' cnblogs.pipelines.MySQLStoreCnblogsPipeline': +,}     

At this point, all the files that need to be modified are modified, the following test to see how the results.

7. Testing

[Email protected] cnblogs]# scrapy crawl Cnblogsspider

To view Database results:

At this point, scrapy Crawl Web content to write to the database has been implemented. However, the function of the crawler is too weak, the most basic file download, distributed crawl and other functions are not available, but also imagine a lot of web site anti-crawler crawl, in case we encounter such a site how to deal with it? In the next period of time, we will solve these problems individually. Imagine if the crawler is strong enough to have enough content; Can we build a vertical search engine of our own? Think on the excitement, enjoy YY go!!!

Last source updated to: Https://github.com/jackgitgz/CnblogsSpider

Scrapy crawler growth Diary write crawl content to MySQL database

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.