Python crawler crawls go data

Source: Internet
Author: User

Previous configuration work has been mentioned in a previous blog post and is now crawled directly

I. Creating a Project

Scrapy Startproject Putu

Two. Creating a spider File

1 scrapy genspider  patubole patubole.com

Three. Start writing the patubole.py file by using a Chrome browser to analyze the XPath expressions for the two fields of the house price and title. The crawl of the network is done through this file.

The following code is the final code

The built-in patubole.py file must implement the Name,parse function, Start_url these three properties

1 #-*-coding:utf-8-*-2 Importscrapy3  fromScrapy.httpImportRequest4  fromUrllibImportParse5  fromPatu.itemsImportPatuitem6 7 8 classPatubolespider (scrapy. Spider):9Name ='Patubole'Ten     #allowed_domains = [' python.jobbole.com '] OneStart_urls = ['http://xa.ganji.com/fang1/'] A  -     defParse (self, response): -Zufang_title=response.xpath ('//*[@class = "F-list-item ershoufang-list"]/dl/dd[1]/a/text ()'). Extract () theZufang_money=response.xpath ('//*[@class = "F-list-item-wrap f-clear"]/dd[5]/div[1]/span[1]/text ()'). Extract () -          forI,jinchZip (Zufang_title,zufang_money): -Print (I, ":", J) - 

Four. Save the crawled data to the database Sufang.

(1) Create a new database in Pycharm

will appear when finished

(2) storing the data in the data table Sufang of the newly created database Zufang

Data crawling is implemented by patubole.py, data storage is implemented by pipelines.py, pipelines.py is supported by items.py data.

So write items.py

1 #-*-coding:utf-8-*-2 3 #Define Here the models for your scraped items4 #5 #See documentation in:6 #https://doc.scrapy.org/en/latest/topics/items.html7 8 Importscrapy9 Ten  One classPatuitem (scrapy. Item): A     #Define the fields for your item here is like: -     #name = Scrapy. Field () -zufang_title=Scrapy. Field () thezufang_money=Scrapy. Field () -     Pass

It's time to go back and modify the patubole.py file that was written to test.

The code is as follows

1 #-*-coding:utf-8-*-2 Importscrapy3  fromScrapy.httpImportRequest4  fromUrllibImportParse5  fromPatu.itemsImportPatuitem6 7 8 classPatubolespider (scrapy. Spider):9Name ='Patubole'Ten     #allowed_domains = [' python.jobbole.com '] OneStart_urls = ['http://xa.ganji.com/fang1/'] A  -     defParse (self, response): -Zufang_title=response.xpath ('//*[@class = "F-list-item ershoufang-list"]/dl/dd[1]/a/text ()'). Extract () theZufang_money=response.xpath ('//*[@class = "F-list-item-wrap f-clear"]/dd[5]/div[1]/span[1]/text ()'). Extract () -Pipinstall=Patuitem () #创建PatuItem实例 for data transfer -          forI,jinchZip (Zufang_title,zufang_money): -pipinstall['Zufang_title']=I +pipinstall['Zufang_money']=J -  +             yieldPipinstall #这一步很重要
A at - #Pass

(3) Patupipeline file configuration in settings.py

1 item_pipelines = {2    'patu.pipelines.PatuPipeline': 300 , 3 }

(5) pipelines.py file code for storing data into the database

which contains knowledge about SQL

1 #Define your item pipelines here2 #3 #Don ' t forget to add your pipeline to the Item_pipelines setting4 #see:https://doc.scrapy.org/en/latest/topics/item-pipeline.html5 6 ImportSqlite37 classPatupipeline (object):8     defOpen_spider (self,spider):9Self.con=sqlite3.connect ('Zufang SQLite')Tenself.cn=self.con.cursor () One  A  -     defProcess_item (self, item, spider): -         #print (Item.zufang_title,item.zufang_money) theInsert_sql='INSERT INTO Sufang (Title,money) VALUES ("{}", "{}")'. Format (item['Zufang_title'],item['Zufang_money']) -         Print(Insert_sql) - Self.cn.execute (insert_sql) - Self.con.commit () +         returnItem -  +     defSpider_close (self,spider): ASelf.con.close ()

Final result

Where the main.py file is added for convenience of the mode, you can use the relevant command to start the crawler directly

Python crawler crawls go data

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.