Scrapy recently found a very interesting site in learning, can host Spider, can also set the task of timing fetching, quite convenient. So I studied, and the more interesting features to share:
- Grab the picture and display it in the item:
below to formally enter the topic of this article,
grab the information of the chain Home deal and show the house pictures :
1. Create a scrapy project:
scrapy startproject lianjia_shub
The following folder is created under the current folder:
│scrapy.cfg
│
└─lianjia_shub
│items.py
│pipelines.py
│settings.py
│__init__.py
│
└─spiders
__init__.py
2. Define item:
Importscrapy
classLianjiashubitem (scrapy. Item):
id = Field ()
title = Field ()
Price = Field ()
addr = Field ()
link = Field ()
#you need to pay attention to the image field here
#The image field is used to store the captured , so you can view the image in Scrapinghub's item browser .
#and the name must be an image, or it won't show the picture.
Image = Field ()
3. Create Spider:
cmd, run the following command:
Scrapy Genspider Lianjia Http://bj.lianjia.com/chengjiao
define Spider:
#-*-coding:utf-8-*-
ImportScrapy
fromScrapy.spiders.initImportInitspider
fromLianjia_shub.itemsImportLianjiashubitem
classLianjiaspider (Initspider):
Name ="Lianjia"
Allowed_domains = ["http://bj.lianjia.com/chengjiao/"]
Start_urls = []
defInit_request (self):
returnScrapy. Request ('http://bj.lianjia.com/chengjiao/pg1/', callback=self.parse_detail_links)
defParse_detail_links (self, Response):
House_lis = Response.css ('. clinch-list Li')
forHouse_liinchHouse_lis:
link = house_li.css ('. Info-panel H2 a::attr ("href")'). Extract_first (). Encode ('Utf-8')
Self.start_urls.append (link)
returnSelf.initialized ()
defParse (self, Response):
House = Lianjiashubitem ()
house['Link'] = Response.url
house['ID'] = Response.url.split ('/') [ -1].split ('.') [0]
Image_url = Response.css ('. Pic-panel img::attr (SRC)'). Extract_first ()
#image is a list. All the images in the image are displayed when displayed in the Scrapinghub.
house['Image'] = [Image_url, Image_url]
house['title'] = Response.css ('. Title-box H1::text'). Extract_first ()
house['Addr'] = Response.css ('. INFO-ITEM01 A::text'). Extract_first ()
house[' Price'] = Response.css ('. Love-money::text'). Extract_first ()
return House
4. Below we need to register an account with Scrapinghub (http://scrapinghub.com/platform/).
5. Install the Scrapinghub client command shub:
pip Install Shub
6. Create a project on Scrapinghub and find the corresponding API key:
API Key: Click Accounts--account Settings--API Key
7. Log in to Shub using the API Key and project ID:
Shub Login
manually entering the API key will create a Scrapinghub profile scrapinghub.yml:
Projects
Default:lianjia_shub
8. Deploy the spider to the Scrapinghub:
Shub Deploy <projectid>
The Project ID can be found in the link: https://dash.scrapinghub.com/p/<projectid>/jobs/
9. Run the spider on Scrapinghub:
the job on Scrapinghub corresponds to the spider we define:
https://dash.scrapinghub.com/p/<projectid>/spider/lianjia/in the top right corner of the page, click
Run Spider:
In the popup dialog box, select the priority of the spider to run. (Can be set to highest if you don't want to wait too long):
after execution, click items to view the captured information:
What ' s Next:
1. change the configuration of the spider as needed:
Project Settings, Settings, Spiders
2. set the timing crawl:
periodic Jobs--Add periodic job
Scrapinghub perform spider crawl and display pictures