Scrapy Example: Crawl Home rental Information

Last Update:2018-10-22 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

This crawl home site, access to Shanghai Changning housing information, from: public number

The crawler is still built with the Scrapy framework, steps: 1. Analyzing Web pages

2.items.py

3.spiders.py

4. pipelines.py

5.settings.py

viewing Web pages

Shanghai changning Rental Information: https://sh.zu.anjuke.com/fangyuan/changning/

items.py

　　　　　　Define the fields here to save the information to crawl

 import   scrapy 
  class   Anjukespideritem (scrapy. Item):  #   Define the fields for your item here L IKE:  #   name = Scrapy. Field () 
  price = Scrapy. Field () rent_type  = Scrapy. Field () house_type  = Scrapy. Field () area  = Scrapy. Field () towards  = Scrapy. Field () floor  = Scrapy. Field () decoration  = Scrapy. Field () building_type  = Scrapy. Field () community  = Scrapy. Field ()

spider.py

Write a crawler here, tell the crawler what to crawl, how to crawl

Importscrapy fromScrapy.spidersImportRule fromScrapy.linkextractorsImportLinkextractor fromAnjukespider.itemsImportAnjukespideritem#Defining ReptilesclassAnjuke (scrapy.spiders.CrawlSpider):#Reptile NameName ='Anjuke'    #Crawler Start PageStart_urls = ['https://sh.zu.anjuke.com/fangyuan/changning/']    #Crawl RulesRules =(Rule (Linkextractor ( allow=r'fangyuan/p\d+/'), Follow=true),#The page contains the next page button, so set true here to crawl all pagesRule (Linkextractor (allow=r'https://sh.zu.anjuke.com/fangyuan/\d{10}'), Follow=false, callback='Parse_item'),#The page contains "recommended" listings but not necessarily the changning we want, so setting false does not follow            )    #callback function, the main is to write the XPath path, the previous instance said, here will not repeat the    defParse_item (Self, Response): Item=Anjukespideritem ()#Rentitem[' Price'] = Int (Response.xpath ("//ul[@class = ' House-info-zufang cf ']/li[1]/span[1]/em/text ()"). Extract_first ())#How to rentitem['Rent_type'] = Response.xpath ("//ul[@class = ' Title-label cf ']/li[1]/text ()"). Extract_first ()#typeitem['House_type'] = Response.xpath ("//ul[@class = ' House-info-zufang cf ']/li[2]/span[2]/text ()"). Extract_first ()#Areaitem[' Area'] = Int (Response.xpath ("//ul[@class = ' House-info-zufang cf ']/li[3]/span[2]/text ()"). Extract_first (). Replace ('sqm',"'))        #towardsitem['towards'] = Response.xpath ("//ul[@class = ' House-info-zufang cf ']/li[4]/span[2]/text ()"). Extract_first ()#Flooritem[' Floor'] = Response.xpath ("//ul[@class = ' House-info-zufang cf ']/li[5]/span[2]/text ()"). Extract_first ()#Decorationitem['Decoration'] = Response.xpath ("//ul[@class = ' House-info-zufang cf ']/li[6]/span[2]/text ()"). Extract_first ()#Housing Typeitem['Building_type'] = Response.xpath ("//ul[@class = ' House-info-zufang cf ']/li[7]/span[2]/text ()"). Extract_first ()#Communityitem['Community'] = Response.xpath ("//ul[@class = ' House-info-zufang cf ']/li[8]/a[1]/text ()"). Extract_first ()yieldItem

pipelines.py

Save the crawled data, which is saved in JSON format only

Actually can not write this part, do not write pipeline, run time add some parameters:scrapy crawl anjuke-o anjuke.json-t JSON

Scrapy Crawl crawler name-o destination file name-T save format

 fromScrapy.exportersImportJsonitemexporterclassAnjukespiderpipeline (object):def __init__(self): Self.file= Open ('Zufang_shanghai.json','WB') #设置文件存储路径self.exporter= Jsonitemexporter (Self.file, ensure_ascii=False) self.exporter.start_exporting ()defProcess_item (self, item, spider):Print('Write') Self.exporter.export_item (item)returnItemdefClose_spider (self, spider):Print("Close") self.exporter.finish_exporting () self.file.close ()

settings.py

Modify the settings file to make pipeline effective

Set the download delay to prevent access too quickly causing the site to be blocked

Item_pipelines = {    'anjukeSpider.pipelines.AnjukespiderPipeline': 300

Run the command line, enter the project root directory, type
```
Scrapy Crawl [crawler name]
```

PS f:\scrapyproject\anjukespider\anjukespider> scrapy Crawl Anjuke

Execution complete

Crawl to 61 information, JSON file has been generated in the specified path

2018-10-22 09:02:55[scrapy.statscollectors] info:dumping scrapy stats:{'downloader/request_bytes': 40861, 'Downloader/request_count': 61, 'Downloader/request_method_count/get': 61, 'downloader/response_bytes': 1925879, 'Downloader/response_count': 61, 'downloader/response_status_count/200': 61, 'Finish_reason':'finished', 'Finish_time': Datetime.datetime (2018, 10, 22, 1, 2, 55, 245128), 'Item_scraped_count': 60, 'Log_count/debug': 122, 'Log_count/info': 9, 'Request_depth_max': 1, 'Response_received_count': 61, 'scheduler/dequeued': 61, 'scheduler/dequeued/memory': 61, 'scheduler/enqueued': 61, 'scheduler/enqueued/memory': 61, 'start_time': Datetime.datetime (2018, 10, 22, 1, 0, 29, 555537)}2018-10-22 09:02:55 [Scrapy.core.engine] Info:spider closed (finished)

This is done by the crawler, but the data crawled is not intuitive, and it needs to be visualized (Pyecharts module), this part of another pyecharts use

Pyecharts Official Document: http://pyecharts.org/#/zh-cn/

Scrapy Example: Crawl Home rental Information

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Scrapy Example: Crawl Home rental Information

Contact Us

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support