Python3 's scrapy basic explanation

Last Update:2018-07-24 Source: Internet

Author: User

Tags function definition xpath

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

The Scrapy framework is a framework for Python spiders to use for systematic quick processing and management of data, an application framework for extracting structured data, and a scrapy framework where you can easily manage the data you crawl.

This is my simple understanding of scrapy.

Here does not introduce the concrete principle and the picture demonstration. (You should have a good understanding of simple reptiles, or how to learn scrapy directly)

If you are just ready to learn scrapy, then you should take a closer look. If you have been studying for some time scrapy, then this article may not be suitable for you to learn, here first only talk about getting started

Through the ability of beginners to achieve

Direct Dry Goods: (Objective: To crawl the watercress 250)

Of course you may not have installed scrapy, here I do not cumbersome to explain, the specific point is that you direct pip install scrapy before you need to install scrapy dependent on the environment

(Pip install parsel,pip install TWISTED,PIP install lxml) There are environments where you can view them online.

Find a place where you store scrapy files later. Execute command: scrapy startproject Get_douban

A folder is generated: Here the bread contains some necessary documents of the Scrapy, as a novice we do not care,

Now you need to create a new douban.py file in Get_doubande, the file we use to write the crawler, and here's the douban.py code.

You may need to know about XPath (http://www.w3school.com.cn/xpath/) Here you can get a quick look.

Import scrapy from scrapy.http Import Request class Doubanspider (scrapy. Spider): name = "Douban" #这个name是你必须给它一个唯一的名字 the name after which we execute the file Start_urls = ["https://movie.douban.com/top250"] #这个列表中的ur L can have multiple, it will be executed sequentially, we here simply crawl a URL = "https://movie.douban.com/top250" #因为豆瓣250有翻页操作, we set this URL to page the Def parse (self, Response): #默认函数parse sites = Response.xpath ('//ol[@class = ' grid_view '] ') # ("The path to match the information you need") #xpath是scrapy里面的一种匹配方式, similar to the positive

		The expression, there are several other ways to match #这里我们首先获得的是我们需要的信息的那一大块sites. Print ("..... The return information is: ") info = Sites.xpath ('./li ') #从sites中我们再进一步获取到所有电影的所有信息 for I in info: #这里的i是每一部电影的信息 #排名 num =
			     I.xpath ('./div//em[@class = ""]//text () "). Extract () #获取到的为列表类型 #extract () is the extractor that takes what we match to print (num[0],end="; ")
				 #标题 title = I.xpath ('.//span[@class = "title"]/text ()). Extract () print (title[0],end= ";") #评论 Remark = I.xpath ('.//span[@class = "Inq"]//text ()). Extract () #分数 score = I.xpath ('./div//span[@class = "rating _num "]//text ()"). Extract () Print (SCORe[0]) Nextlink = Response.xpath ('//span[@class = Next ']/link/@href '). Extract () #还记得我们之前定义的url吗, because the movie too many pages have page-flipping display, Here we get the link to the flip button nextlink if Nextlink: #翻到最后一页是没有连接的, so here we have to judge Nextlink = nextlink[0] Print (nextlink) yield R Equest (Self.url+nextlink,callback=self.parse) #yield中断返回下一页的连接到parse让它重新从下一页开始爬取, callback return function definition where to return

The above is spiders's douban.py inside of the code, now how to implement it.

Open cmd in get_douban file Enter command to execute file: Scrapy crawl Douban carriage return

You will get the following information:

So you realize the use of scrapy simple crawler, climbed the watercress 250, have any comments can be mentioned.

We have not yet explained the other functions in scrapy, such as items.py and so on, first familiar with the simple bar.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More