Previous configuration work has been mentioned in a previous blog post and is now crawled directly
I. Creating a Project
Scrapy Startproject Putu
Two. Creating a spider File
1 scrapy genspider patubole patubole.com
Three. Start writing the patubole.py file by using a Chrome browser to analyze the XPath expressions for the two fields of the house price and title. The crawl of the network is done through this file.
The following code is the final code
The built-in patubole.py file must implement the Name,parse function, Start_url these three properties
1 #-*-coding:utf-8-*-2 Importscrapy3 fromScrapy.httpImportRequest4 fromUrllibImportParse5 fromPatu.itemsImportPatuitem6 7 8 classPatubolespider (scrapy. Spider):9Name ='Patubole'Ten #allowed_domains = [' python.jobbole.com '] OneStart_urls = ['http://xa.ganji.com/fang1/'] A - defParse (self, response): -Zufang_title=response.xpath ('//*[@class = "F-list-item ershoufang-list"]/dl/dd[1]/a/text ()'). Extract () theZufang_money=response.xpath ('//*[@class = "F-list-item-wrap f-clear"]/dd[5]/div[1]/span[1]/text ()'). Extract () - forI,jinchZip (Zufang_title,zufang_money): -Print (I, ":", J) -
Four. Save the crawled data to the database Sufang.
(1) Create a new database in Pycharm
will appear when finished
(2) storing the data in the data table Sufang of the newly created database Zufang
Data crawling is implemented by patubole.py, data storage is implemented by pipelines.py, pipelines.py is supported by items.py data.
So write items.py
1 #-*-coding:utf-8-*-2 3 #Define Here the models for your scraped items4 #5 #See documentation in:6 #https://doc.scrapy.org/en/latest/topics/items.html7 8 Importscrapy9 Ten One classPatuitem (scrapy. Item): A #Define the fields for your item here is like: - #name = Scrapy. Field () -zufang_title=Scrapy. Field () thezufang_money=Scrapy. Field () - Pass
It's time to go back and modify the patubole.py file that was written to test.
The code is as follows
1 #-*-coding:utf-8-*-2 Importscrapy3 fromScrapy.httpImportRequest4 fromUrllibImportParse5 fromPatu.itemsImportPatuitem6 7 8 classPatubolespider (scrapy. Spider):9Name ='Patubole'Ten #allowed_domains = [' python.jobbole.com '] OneStart_urls = ['http://xa.ganji.com/fang1/'] A - defParse (self, response): -Zufang_title=response.xpath ('//*[@class = "F-list-item ershoufang-list"]/dl/dd[1]/a/text ()'). Extract () theZufang_money=response.xpath ('//*[@class = "F-list-item-wrap f-clear"]/dd[5]/div[1]/span[1]/text ()'). Extract () -Pipinstall=Patuitem () #创建PatuItem实例 for data transfer - forI,jinchZip (Zufang_title,zufang_money): -pipinstall['Zufang_title']=I +pipinstall['Zufang_money']=J - + yieldPipinstall #这一步很重要
A at - #Pass
(3) Patupipeline file configuration in settings.py
1 item_pipelines = {2 'patu.pipelines.PatuPipeline': 300 , 3 }
(5) pipelines.py file code for storing data into the database
which contains knowledge about SQL
1 #Define your item pipelines here2 #3 #Don ' t forget to add your pipeline to the Item_pipelines setting4 #see:https://doc.scrapy.org/en/latest/topics/item-pipeline.html5 6 ImportSqlite37 classPatupipeline (object):8 defOpen_spider (self,spider):9Self.con=sqlite3.connect ('Zufang SQLite')Tenself.cn=self.con.cursor () One A - defProcess_item (self, item, spider): - #print (Item.zufang_title,item.zufang_money) theInsert_sql='INSERT INTO Sufang (Title,money) VALUES ("{}", "{}")'. Format (item['Zufang_title'],item['Zufang_money']) - Print(Insert_sql) - Self.cn.execute (insert_sql) - Self.con.commit () + returnItem - + defSpider_close (self,spider): ASelf.con.close ()
Final result
Where the main.py file is added for convenience of the mode, you can use the relevant command to start the crawler directly
Python crawler crawls go data