Scrapy Example: Crawl Home rental Information

Source: Internet
Author: User

This crawl home site, access to Shanghai Changning housing information, from: public number

The crawler is still built with the Scrapy framework, steps: 1. Analyzing Web pages

2.items.py

3.spiders.py

4. pipelines.py

5.settings.py

    • viewing Web pages

Shanghai changning Rental Information: https://sh.zu.anjuke.com/fangyuan/changning/

    • items.py

      Define the fields here to save the information to crawl

 import   scrapy 
class Anjukespideritem (scrapy. Item): # Define the fields for your item here L IKE: # name = Scrapy. Field ()
price = Scrapy. Field () rent_type = Scrapy. Field () house_type = Scrapy. Field () area = Scrapy. Field () towards = Scrapy. Field () floor = Scrapy. Field () decoration = Scrapy. Field () building_type = Scrapy. Field () community = Scrapy. Field ()
    • spider.py

Write a crawler here, tell the crawler what to crawl, how to crawl

Importscrapy fromScrapy.spidersImportRule fromScrapy.linkextractorsImportLinkextractor fromAnjukespider.itemsImportAnjukespideritem#Defining ReptilesclassAnjuke (scrapy.spiders.CrawlSpider):#Reptile NameName ='Anjuke'    #Crawler Start PageStart_urls = ['https://sh.zu.anjuke.com/fangyuan/changning/']    #Crawl RulesRules =(Rule (Linkextractor ( allow=r'fangyuan/p\d+/'), Follow=true),#The page contains the next page button, so set true here to crawl all pagesRule (Linkextractor (allow=r'https://sh.zu.anjuke.com/fangyuan/\d{10}'), Follow=false, callback='Parse_item'),#The page contains "recommended" listings but not necessarily the changning we want, so setting false does not follow            )    #callback function, the main is to write the XPath path, the previous instance said, here will not repeat the    defParse_item (Self, Response): Item=Anjukespideritem ()#Rentitem[' Price'] = Int (Response.xpath ("//ul[@class = ' House-info-zufang cf ']/li[1]/span[1]/em/text ()"). Extract_first ())#How to rentitem['Rent_type'] = Response.xpath ("//ul[@class = ' Title-label cf ']/li[1]/text ()"). Extract_first ()#typeitem['House_type'] = Response.xpath ("//ul[@class = ' House-info-zufang cf ']/li[2]/span[2]/text ()"). Extract_first ()#Areaitem[' Area'] = Int (Response.xpath ("//ul[@class = ' House-info-zufang cf ']/li[3]/span[2]/text ()"). Extract_first (). Replace ('sqm',"'))        #towardsitem['towards'] = Response.xpath ("//ul[@class = ' House-info-zufang cf ']/li[4]/span[2]/text ()"). Extract_first ()#Flooritem[' Floor'] = Response.xpath ("//ul[@class = ' House-info-zufang cf ']/li[5]/span[2]/text ()"). Extract_first ()#Decorationitem['Decoration'] = Response.xpath ("//ul[@class = ' House-info-zufang cf ']/li[6]/span[2]/text ()"). Extract_first ()#Housing Typeitem['Building_type'] = Response.xpath ("//ul[@class = ' House-info-zufang cf ']/li[7]/span[2]/text ()"). Extract_first ()#Communityitem['Community'] = Response.xpath ("//ul[@class = ' House-info-zufang cf ']/li[8]/a[1]/text ()"). Extract_first ()yieldItem
    • pipelines.py

Save the crawled data, which is saved in JSON format only

Actually can not write this part, do not write pipeline, run time add some parameters:scrapy crawl anjuke-o anjuke.json-t JSON

Scrapy Crawl crawler name-o destination file name-T save format

 fromScrapy.exportersImportJsonitemexporterclassAnjukespiderpipeline (object):def __init__(self): Self.file= Open ('Zufang_shanghai.json','WB') #设置文件存储路径self.exporter= Jsonitemexporter (Self.file, ensure_ascii=False) self.exporter.start_exporting ()defProcess_item (self, item, spider):Print('Write') Self.exporter.export_item (item)returnItemdefClose_spider (self, spider):Print("Close") self.exporter.finish_exporting () self.file.close ()
    • settings.py

Modify the settings file to make pipeline effective

Set the download delay to prevent access too quickly causing the site to be blocked

Item_pipelines = {    'anjukeSpider.pipelines.AnjukespiderPipeline': 300

    • Run the command line, enter the project root directory, type
      Scrapy Crawl [crawler name]
    • PS f:\scrapyproject\anjukespider\anjukespider> scrapy Crawl Anjuke

      Execution complete

Crawl to 61 information, JSON file has been generated in the specified path

  • 2018-10-22 09:02:55[scrapy.statscollectors] info:dumping scrapy stats:{'downloader/request_bytes': 40861, 'Downloader/request_count': 61, 'Downloader/request_method_count/get': 61, 'downloader/response_bytes': 1925879, 'Downloader/response_count': 61, 'downloader/response_status_count/200': 61, 'Finish_reason':'finished', 'Finish_time': Datetime.datetime (2018, 10, 22, 1, 2, 55, 245128), 'Item_scraped_count': 60, 'Log_count/debug': 122, 'Log_count/info': 9, 'Request_depth_max': 1, 'Response_received_count': 61, 'scheduler/dequeued': 61, 'scheduler/dequeued/memory': 61, 'scheduler/enqueued': 61, 'scheduler/enqueued/memory': 61, 'start_time': Datetime.datetime (2018, 10, 22, 1, 0, 29, 555537)}2018-10-22 09:02:55 [Scrapy.core.engine] Info:spider closed (finished)

    This is done by the crawler, but the data crawled is not intuitive, and it needs to be visualized (Pyecharts module), this part of another pyecharts use

  • Pyecharts Official Document: http://pyecharts.org/#/zh-cn/

Scrapy Example: Crawl Home rental Information

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.