Nineth Data Sheet design and save item to JSON file

Source: Internet
Author: User
Tags sql using virtual environment

The last section says that pipeline will intercept item, depending on the priority of the set, item will go through these pipeline in turn, so you can save the file to JSON, database, etc. by pipeline.

The following is a custom JSON

#Store item to JSON fileclassJsonwithencodingpipeline (object):def __init__(self):#using the Codecs module to open the file, can help us solve a lot of coding problems, the following first initialization of open a JSON file        ImportCodecs Self.file= Codecs.open ('Article.json','W', encoding='Utf-8')    #then create the Process_item method to perform the specific action of item    defProcess_item (self, item, spider):ImportJSON#Note that the Ensure_ascii entry parameter is set to FALSE, otherwise the stored non-English characters will be error-Lines = Json.dumps (Dict (item), Ensure_ascii=false) +"\ n"Self.file.write (lines)#Note that you need to return item at the end, because it might be called later by pipeline        returnItem#finally close the file    defSpider_close (Self,spider): Self.file.close ()

The JSON method is built into the scrapy:

 from Import Jsonitemexporter

In addition to Jsonitemexporter,scrapy offers a variety of types of exporter

classJsonexporterpipeline (object):#call the JSON export JSON file provided by Scrapy    def __init__(self):#Open a JSON fileSelf.file = open ('Articleexport.json','WB')        #Create a exporter instance with the following three entry parameters, similar to the previous custom export JSONSelf.exporter = Jsonitemexporter (self.file,encoding='Utf-8', ensure_ascii=False)#Start Exportself.exporter.start_exporting ()defClose_spider (self,spider):#Complete the Exportself.exporter.finish_exporting ()#Close Fileself.file.close ()#Finally, you need to call Process_item to return to item    defProcess_item (self, item, spider): Self.exporter.export_item (item)returnItem

Compared to custom JSON, the file is saved by the ""

Through the source can be seen as follows:

Then how to store data to MySQL, I this development environment is Ubuntu, support mysql-client tools, free to use MySQL Workbench, can also use Navicat (charge)

The spider is going to create a table that corresponds to item one by one in the Articlespider project.

Then the next step is to configure the program to connect MySQL

Here I use the third-party library pymysql to connect to MySQL, the installation method is very simple, you can use the Pycharm built-in package installation, can also be installed in the virtual environment with PIP

Then create the MySQL pipline directly in the pipline.

ImportPymysqlclassMysqlpipeline (object):def __init__(self):"""Initialize, build MySQL connection conn, and create cursors cursor"""Self.conn=Pymysql.connect (Host='localhost', Database='Spider', the user='Root', passwd='123456', CharSet='UTF8', Use_unicode=True) Self.cursor=self.conn.cursor ()defProcess_item (self,item,spider):#the SQL statement to executeInsert_sql ="""INSERT INTO Jobbole_article (title,create_date,url,url_object_id, FRONT_IMAGE_URL,FRONT_IMAGE_PA th,praise_num,comment_num,fav_num,tags,content) VALUES (%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s)"""        #Execute SQL using the Execute method of the cursorSelf.cursor.execute (Insert_sql, (item["title"],item['create_date'], item['URL'],item['url_object_id'], item['Front_image_url'],item['Front_image_path'], item['Praise_num'],item['Comment_num'],item['Fav_num'], item['Tags'],item['content']))        #commit commits to take effectSelf.conn.commit ()returnItem

The above MySQL storage method is synchronous, that is, execute and commit do not play, is unable to continue to store data, and the obvious scrapy crawl speed than the data stored in MySQL faster,

So Scrapy provides another way to store asynchronous data (an asynchronous container or a pymysql)

First, the MySQL Configuration connection information is written into the setting configuration file, convenient for later modification

" localhost "  'spider'"root"  123456"

The asynchronous interface provided by Scrapy is then imported in pipeline: Adbapi

 from Import Adbapi

The complete pipeline is as follows:

classMysqltwistedpipeline (object):#The following two functions are completed when the spider is started, the Dbpool is passed in.    def __init__(self,dbpool): Self.dbpool=Dbpool#in this way, it is convenient to get setting configuration information@classmethoddeffrom_settings (cls,setting): Dbparms=Dict (Host= setting['Mysql_host'], DB= setting['Mysql_dbname'], user= setting['Mysql_user'], password= setting['Mysql_password'], CharSet='UTF8',        #Cursorclass = pymysql.cursors.DictCursor,Use_unicode=True,)#Create a connection pool,Dbpool = Adbapi. ConnectionPool ("Pymysql",**dbparms)returnCLS (dbpool)#use twisted to insert MySQL into asynchronous execution    defProcess_item (self, item, spider):#specify data for action methods and actionsquery =self.dbpool.runInteraction (Self.do_insert,item)#handling possible exceptions, Hangdle_error is a custom methodQuery.adderrback (Self.handle_error,item,spider)defHandle_error (self,failure,item,spider):Print(Failure)defDo_insert (self,cursor,item):#perform a specific insert        #build different SQL statements and insert them into MySQL based on different itemInsert_sql ="""INSERT INTO Jobbole_article (title,create_date,url,url_object_id, Front_im age_url,front_image_path,praise_num,comment_num,fav_num,tags,content) VALUES (%s,%s,%s,%s,%s,%s,%s,% s,%s,%s,%s)"""        #Execute SQL using the Execute method of the cursorCursor.execute (Insert_sql, (item["title"], item['create_date'], item['URL'], item['url_object_id'], item['Front_image_url'], item['Front_image_path'], item['Praise_num'], item['Comment_num'], item['Fav_num'], item['Tags'], item['content']))

Note: Importing Pymysql requires a separate import cursors

Import Pymysql Import Pymysql.cursors

In general, we just need to modify the Do_insert method content.

Also, the item to be passed to the field corresponding to the data table, cannot assume that the value will automatically default to null (but stored in the JSON file is the case)

Nineth Data Sheet design and save item to JSON file

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.