This crawl home site, access to Shanghai Changning housing information, from: public number
The crawler is still built with the Scrapy framework, steps: 1. Analyzing Web pages
2.items.py
3.spiders.py
4. pipelines.py
5.settings.py
Shanghai changning Rental Information: https://sh.zu.anjuke.com/fangyuan/changning/
Define the fields here to save the information to crawl
import scrapy
class Anjukespideritem (scrapy. Item): # Define the fields for your item here L IKE: # name = Scrapy. Field ()
price = Scrapy. Field () rent_type = Scrapy. Field () house_type = Scrapy. Field () area = Scrapy. Field () towards = Scrapy. Field () floor = Scrapy. Field () decoration = Scrapy. Field () building_type = Scrapy. Field () community = Scrapy. Field ()
Write a crawler here, tell the crawler what to crawl, how to crawl
Importscrapy fromScrapy.spidersImportRule fromScrapy.linkextractorsImportLinkextractor fromAnjukespider.itemsImportAnjukespideritem#Defining ReptilesclassAnjuke (scrapy.spiders.CrawlSpider):#Reptile NameName ='Anjuke' #Crawler Start PageStart_urls = ['https://sh.zu.anjuke.com/fangyuan/changning/'] #Crawl RulesRules =(Rule (Linkextractor ( allow=r'fangyuan/p\d+/'), Follow=true),#The page contains the next page button, so set true here to crawl all pagesRule (Linkextractor (allow=r'https://sh.zu.anjuke.com/fangyuan/\d{10}'), Follow=false, callback='Parse_item'),#The page contains "recommended" listings but not necessarily the changning we want, so setting false does not follow ) #callback function, the main is to write the XPath path, the previous instance said, here will not repeat the defParse_item (Self, Response): Item=Anjukespideritem ()#Rentitem[' Price'] = Int (Response.xpath ("//ul[@class = ' House-info-zufang cf ']/li[1]/span[1]/em/text ()"). Extract_first ())#How to rentitem['Rent_type'] = Response.xpath ("//ul[@class = ' Title-label cf ']/li[1]/text ()"). Extract_first ()#typeitem['House_type'] = Response.xpath ("//ul[@class = ' House-info-zufang cf ']/li[2]/span[2]/text ()"). Extract_first ()#Areaitem[' Area'] = Int (Response.xpath ("//ul[@class = ' House-info-zufang cf ']/li[3]/span[2]/text ()"). Extract_first (). Replace ('sqm',"')) #towardsitem['towards'] = Response.xpath ("//ul[@class = ' House-info-zufang cf ']/li[4]/span[2]/text ()"). Extract_first ()#Flooritem[' Floor'] = Response.xpath ("//ul[@class = ' House-info-zufang cf ']/li[5]/span[2]/text ()"). Extract_first ()#Decorationitem['Decoration'] = Response.xpath ("//ul[@class = ' House-info-zufang cf ']/li[6]/span[2]/text ()"). Extract_first ()#Housing Typeitem['Building_type'] = Response.xpath ("//ul[@class = ' House-info-zufang cf ']/li[7]/span[2]/text ()"). Extract_first ()#Communityitem['Community'] = Response.xpath ("//ul[@class = ' House-info-zufang cf ']/li[8]/a[1]/text ()"). Extract_first ()yieldItem
Save the crawled data, which is saved in JSON format only
Actually can not write this part, do not write pipeline, run time add some parameters:scrapy crawl anjuke-o anjuke.json-t JSON
Scrapy Crawl crawler name-o destination file name-T save format
fromScrapy.exportersImportJsonitemexporterclassAnjukespiderpipeline (object):def __init__(self): Self.file= Open ('Zufang_shanghai.json','WB') #设置文件存储路径self.exporter= Jsonitemexporter (Self.file, ensure_ascii=False) self.exporter.start_exporting ()defProcess_item (self, item, spider):Print('Write') Self.exporter.export_item (item)returnItemdefClose_spider (self, spider):Print("Close") self.exporter.finish_exporting () self.file.close ()
Modify the settings file to make pipeline effective
Set the download delay to prevent access too quickly causing the site to be blocked
Item_pipelines = { 'anjukeSpider.pipelines.AnjukespiderPipeline': 300
Crawl to 61 information, JSON file has been generated in the specified path
2018-10-22 09:02:55[scrapy.statscollectors] info:dumping scrapy stats:{'downloader/request_bytes': 40861, 'Downloader/request_count': 61, 'Downloader/request_method_count/get': 61, 'downloader/response_bytes': 1925879, 'Downloader/response_count': 61, 'downloader/response_status_count/200': 61, 'Finish_reason':'finished', 'Finish_time': Datetime.datetime (2018, 10, 22, 1, 2, 55, 245128), 'Item_scraped_count': 60, 'Log_count/debug': 122, 'Log_count/info': 9, 'Request_depth_max': 1, 'Response_received_count': 61, 'scheduler/dequeued': 61, 'scheduler/dequeued/memory': 61, 'scheduler/enqueued': 61, 'scheduler/enqueued/memory': 61, 'start_time': Datetime.datetime (2018, 10, 22, 1, 0, 29, 555537)}2018-10-22 09:02:55 [Scrapy.core.engine] Info:spider closed (finished)This is done by the crawler, but the data crawled is not intuitive, and it needs to be visualized (Pyecharts module), this part of another pyecharts use
- Pyecharts Official Document: http://pyecharts.org/#/zh-cn/
Scrapy Example: Crawl Home rental Information