Scrapy Crawl information about home rental and generate mobile app (i)

Last Update:2016-04-20 Source: Internet

Author: User

Tags xpath

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Let's start by documenting a small example of what I did with scrapy.

Version of software used: Python 2.7.11, Scrapy 1.0.5

1. Operating procedures commonly used by Scrapy: Ur2im

URL, Request, Response, Items, more URLs. You can use the following diagram to explain briefly:

2. Scrapy Shell

The Scrapy shell is a very useful tool that is often used for debugging purposes. Then after the PIP install Scrapy installation scrap is successful, for example, under cmd input:

>scrapy Shell http://www.baidu.com

Get returned as follows, it means that Baidu's homepage has been captured successfully:

In order to prevent the crawling of malicious crawlers, some websites need to simulate the browser, we can add the User_agent property in the shell command, that is:

>scrapy shell-s user_agent= "mozilla/6.0" http://www.baidu.com

You can then use XPath to parse the required fields in the HTML, and then enter:

print response.xpath ('//title/text ()'). Extract () [0] Baidu a bit, you know

3. XPath expressions

In the review element you can see where we are asking for the crawl content in HTML:

For example, the information you need to crawl in your project is: Title, location and price.

From this you can get its XPath expression as:

>>> Response.xpath ('//h1/text ()'). Extract ()#title[u'must be SEEN!! BRAND NEW STUDIO flats in excellent !']>>> Response.xpath ('//strong[@class = "Ad-price txt-xlarge txt-emphasis"]/text ()') [0].extract ()# PriceU'\n\xa3775.00pm'>>> Response.xpath ('//*[@class = "ad-location truncate-line set-left"]/text ()'). Extract ()#Address[u'\nsouth Croydon, london\n']

4. Create a Scrapy project

In CMD, CD to the directory where you want to create the project:

>scrapy Startproject <project name>

First step: Define the elements that need to be crawled

If you find items.py and edit it, you will see a class named a combination of the project name and the item group, edited by:

 from Import Item, Field class Propertiesitem (Item):        title=field () price =field ()    Description=  Field ()    Address=field ()    image_urls=field ()        Images=field () location =field ()    url=field ()    Project=field ()    spider =field ()    Server=field ()    date=field ()

Step two: Define the crawler

In the project, cmd input:

>scrapy Genspider Basic Gumtree #basic is the name of this spider, Gumtree is the attribute of Allowd_domain

You will see the basic.py file under the Spider folder and edit it:

#-*-coding:utf-8-*-Importscrapy fromProperties.itemsImportPropertiesitem fromScrapy.httpImportRequestImportUrlparseclassBasicspider (scrapy. Spider): Name=the basic"Allowed_domains= ["gumtree.com"] Start_urls= ( 'Https://www.gumtree.com/all/uk/flat/page1',    ) defParse (self, response):#get to next page.Next_selector=response.xpath ('//*[@data-analytics= "Gaevent:paginationnext"]/@href'). Extract () forUrlinchNext_selector:yieldRequest (urlparse.urljoin (response.url, url))#get into each item page.Item_selector=response.xpath ('//*[@itemprop = "url"]/@href'). Extract () forUrlinchItem_selector:yieldRequest (Urlparse.urljoin (response.url, url), callback=Self.parse_item)defParse_item (Self, Response): Item=Propertiesitem () item['title']=response.xpath ('//h1/text ()'). Extract () item[' Price']=response.xpath ('//strong[@class = "Ad-price txt-xlarge txt-emphasis"]/text ()') [0].re ('[,. 0-9]+') item['Description']=[response.xpath ('//p[@itemprop = "description"]/text ()') [0].extract (). Strip ()] item['Address']=[response.xpath ('//*[@class = "ad-location truncate-line set-left"]/text ()'). Extract () [0].strip ()] item['Image_urls']=[response.xpath ('//img[@itemprop = "image"]/@src') [0].extract ()] item['URL']=[Response.url]returnItem

In the parse function we use yield. In fact, yield I understand that it's something like return, he has a return, and unlike return, he should not jump out of the for loop after running, but continue. This achieves the effect of getting each entry on the page while paging.

CMD under Input:

>scrapy Crawl Basic

The crawler can be run until all entries under this directory are crawled, and the crawled entries are usually set to automatically terminate the crawl:

>scrapy Crawl Basic-s closespider_itemcount=100

Scrapy Crawl information about home rental and generate mobile app (i)

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More