Let's start by documenting a small example of what I did with scrapy.
Version of software used: Python 2.7.11, Scrapy 1.0.5
1. Operating procedures commonly used by Scrapy: Ur2im
URL, Request, Response, Items, more URLs. You can use the following diagram to explain briefly:
2. Scrapy Shell
The Scrapy shell is a very useful tool that is often used for debugging purposes. Then after the PIP install Scrapy installation scrap is successful, for example, under cmd input:
>scrapy Shell http://www.baidu.com
Get returned as follows, it means that Baidu's homepage has been captured successfully:
In order to prevent the crawling of malicious crawlers, some websites need to simulate the browser, we can add the User_agent property in the shell command, that is:
>scrapy shell-s user_agent= "mozilla/6.0" http://www.baidu.com
You can then use XPath to parse the required fields in the HTML, and then enter:
print response.xpath ('//title/text ()'). Extract () [0] Baidu a bit, you know
3. XPath expressions
In the review element you can see where we are asking for the crawl content in HTML:
For example, the information you need to crawl in your project is: Title, location and price.
From this you can get its XPath expression as:
>>> Response.xpath ('//h1/text ()'). Extract ()#title[u'must be SEEN!! BRAND NEW STUDIO flats in excellent !']>>> Response.xpath ('//strong[@class = "Ad-price txt-xlarge txt-emphasis"]/text ()') [0].extract ()# PriceU'\n\xa3775.00pm'>>> Response.xpath ('//*[@class = "ad-location truncate-line set-left"]/text ()'). Extract ()#Address[u'\nsouth Croydon, london\n']
4. Create a Scrapy project
In CMD, CD to the directory where you want to create the project:
>scrapy Startproject <project name>
First step: Define the elements that need to be crawled
If you find items.py and edit it, you will see a class named a combination of the project name and the item group, edited by:
from Import Item, Field class Propertiesitem (Item): title=field () price =field () Description= Field () Address=field () image_urls=field () Images=field () location =field () url=field () Project=field () spider =field () Server=field () date=field ()
Step two: Define the crawler
In the project, cmd input:
>scrapy Genspider Basic Gumtree #basic is the name of this spider, Gumtree is the attribute of Allowd_domain
You will see the basic.py file under the Spider folder and edit it:
#-*-coding:utf-8-*-Importscrapy fromProperties.itemsImportPropertiesitem fromScrapy.httpImportRequestImportUrlparseclassBasicspider (scrapy. Spider): Name=the basic"Allowed_domains= ["gumtree.com"] Start_urls= ( 'Https://www.gumtree.com/all/uk/flat/page1', ) defParse (self, response):#get to next page.Next_selector=response.xpath ('//*[@data-analytics= "Gaevent:paginationnext"]/@href'). Extract () forUrlinchNext_selector:yieldRequest (urlparse.urljoin (response.url, url))#get into each item page.Item_selector=response.xpath ('//*[@itemprop = "url"]/@href'). Extract () forUrlinchItem_selector:yieldRequest (Urlparse.urljoin (response.url, url), callback=Self.parse_item)defParse_item (Self, Response): Item=Propertiesitem () item['title']=response.xpath ('//h1/text ()'). Extract () item[' Price']=response.xpath ('//strong[@class = "Ad-price txt-xlarge txt-emphasis"]/text ()') [0].re ('[,. 0-9]+') item['Description']=[response.xpath ('//p[@itemprop = "description"]/text ()') [0].extract (). Strip ()] item['Address']=[response.xpath ('//*[@class = "ad-location truncate-line set-left"]/text ()'). Extract () [0].strip ()] item['Image_urls']=[response.xpath ('//img[@itemprop = "image"]/@src') [0].extract ()] item['URL']=[Response.url]returnItem
In the parse function we use yield. In fact, yield I understand that it's something like return, he has a return, and unlike return, he should not jump out of the for loop after running, but continue. This achieves the effect of getting each entry on the page while paging.
CMD under Input:
>scrapy Crawl Basic
The crawler can be run until all entries under this directory are crawled, and the crawled entries are usually set to automatically terminate the crawl:
>scrapy Crawl Basic-s closespider_itemcount=100
Scrapy Crawl information about home rental and generate mobile app (i)