Scrapy Crawl information about home rental and generate mobile app (i)

Source: Internet
Author: User
Tags xpath



Let's start by documenting a small example of what I did with scrapy.



Version of software used: Python 2.7.11, Scrapy 1.0.5






1. Operating procedures commonly used by Scrapy: Ur2im



URL, Request, Response, Items, more URLs. You can use the following diagram to explain briefly:














2. Scrapy Shell



The Scrapy shell is a very useful tool that is often used for debugging purposes. Then after the PIP install Scrapy installation scrap is successful, for example, under cmd input:



>scrapy Shell http://www.baidu.com



Get returned as follows, it means that Baidu's homepage has been captured successfully:






In order to prevent the crawling of malicious crawlers, some websites need to simulate the browser, we can add the User_agent property in the shell command, that is:



>scrapy shell-s user_agent= "mozilla/6.0" http://www.baidu.com



You can then use XPath to parse the required fields in the HTML, and then enter:


print response.xpath ('//title/text ()'). Extract () [0] Baidu a bit, you know





3. XPath expressions



In the review element you can see where we are asking for the crawl content in HTML:






For example, the information you need to crawl in your project is: Title, location and price.



From this you can get its XPath expression as:


>>> Response.xpath ('//h1/text ()'). Extract ()#title[u'must be SEEN!! BRAND NEW STUDIO flats in excellent !']>>> Response.xpath ('//strong[@class = "Ad-price txt-xlarge txt-emphasis"]/text ()') [0].extract ()# PriceU'\n\xa3775.00pm'>>> Response.xpath ('//*[@class = "ad-location truncate-line set-left"]/text ()'). Extract ()#Address[u'\nsouth Croydon, london\n']





4. Create a Scrapy project



In CMD, CD to the directory where you want to create the project:



>scrapy Startproject <project name>



First step: Define the elements that need to be crawled



If you find items.py and edit it, you will see a class named a combination of the project name and the item group, edited by:


 from Import Item, Field class Propertiesitem (Item):        title=field () price =field ()    Description=  Field ()    Address=field ()    image_urls=field ()        Images=field () location =field ()    url=field ()    Project=field ()    spider =field ()    Server=field ()    date=field ()


Step two: Define the crawler



In the project, cmd input:



>scrapy Genspider Basic Gumtree #basic is the name of this spider, Gumtree is the attribute of Allowd_domain



You will see the basic.py file under the Spider folder and edit it:


#-*-coding:utf-8-*-Importscrapy fromProperties.itemsImportPropertiesitem fromScrapy.httpImportRequestImportUrlparseclassBasicspider (scrapy. Spider): Name=the basic"Allowed_domains= ["gumtree.com"] Start_urls= ( 'Https://www.gumtree.com/all/uk/flat/page1',    ) defParse (self, response):#get to next page.Next_selector=response.xpath ('//*[@data-analytics= "Gaevent:paginationnext"]/@href'). Extract () forUrlinchNext_selector:yieldRequest (urlparse.urljoin (response.url, url))#get into each item page.Item_selector=response.xpath ('//*[@itemprop = "url"]/@href'). Extract () forUrlinchItem_selector:yieldRequest (Urlparse.urljoin (response.url, url), callback=Self.parse_item)defParse_item (Self, Response): Item=Propertiesitem () item['title']=response.xpath ('//h1/text ()'). Extract () item[' Price']=response.xpath ('//strong[@class = "Ad-price txt-xlarge txt-emphasis"]/text ()') [0].re ('[,. 0-9]+') item['Description']=[response.xpath ('//p[@itemprop = "description"]/text ()') [0].extract (). Strip ()] item['Address']=[response.xpath ('//*[@class = "ad-location truncate-line set-left"]/text ()'). Extract () [0].strip ()] item['Image_urls']=[response.xpath ('//img[@itemprop = "image"]/@src') [0].extract ()] item['URL']=[Response.url]returnItem


In the parse function we use yield. In fact, yield I understand that it's something like return, he has a return, and unlike return, he should not jump out of the for loop after running, but continue. This achieves the effect of getting each entry on the page while paging.



CMD under Input:



>scrapy Crawl Basic



The crawler can be run until all entries under this directory are crawled, and the crawled entries are usually set to automatically terminate the crawl:



>scrapy Crawl Basic-s closespider_itemcount=100



Scrapy Crawl information about home rental and generate mobile app (i)


Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.