Python3 's scrapy basic explanation

Source: Internet
Author: User
Tags function definition xpath

The Scrapy framework is a framework for Python spiders to use for systematic quick processing and management of data, an application framework for extracting structured data, and a scrapy framework where you can easily manage the data you crawl.

This is my simple understanding of scrapy.


Here does not introduce the concrete principle and the picture demonstration. (You should have a good understanding of simple reptiles, or how to learn scrapy directly)

If you are just ready to learn scrapy, then you should take a closer look. If you have been studying for some time scrapy, then this article may not be suitable for you to learn, here first only talk about getting started


Through the ability of beginners to achieve

Direct Dry Goods: (Objective: To crawl the watercress 250)

Of course you may not have installed scrapy, here I do not cumbersome to explain, the specific point is that you direct pip install scrapy before you need to install scrapy dependent on the environment

(Pip install parsel,pip install TWISTED,PIP install lxml) There are environments where you can view them online.


Find a place where you store scrapy files later. Execute command: scrapy startproject Get_douban

A folder is generated: Here the bread contains some necessary documents of the Scrapy, as a novice we do not care,

Now you need to create a new douban.py file in Get_doubande, the file we use to write the crawler, and here's the douban.py code.


You may need to know about XPath (http://www.w3school.com.cn/xpath/) Here you can get a quick look.

Import scrapy from scrapy.http Import Request class Doubanspider (scrapy. Spider): name = "Douban" #这个name是你必须给它一个唯一的名字 the name after which we execute the file Start_urls = ["https://movie.douban.com/top250"] #这个列表中的ur L can have multiple, it will be executed sequentially, we here simply crawl a URL = "https://movie.douban.com/top250" #因为豆瓣250有翻页操作, we set this URL to page the Def parse (self, Response): #默认函数parse sites = Response.xpath ('//ol[@class = ' grid_view '] ') # ("The path to match the information you need") #xpath是scrapy里面的一种匹配方式, similar to the positive

		The expression, there are several other ways to match #这里我们首先获得的是我们需要的信息的那一大块sites. Print ("..... The return information is: ") info = Sites.xpath ('./li ') #从sites中我们再进一步获取到所有电影的所有信息 for I in info: #这里的i是每一部电影的信息 #排名 num =
			     I.xpath ('./div//em[@class = ""]//text () "). Extract () #获取到的为列表类型 #extract () is the extractor that takes what we match to print (num[0],end="; ")
				 #标题 title = I.xpath ('.//span[@class = "title"]/text ()). Extract () print (title[0],end= ";") #评论 Remark = I.xpath ('.//span[@class = "Inq"]//text ()). Extract () #分数 score = I.xpath ('./div//span[@class = "rating _num "]//text ()"). Extract () Print (SCORe[0]) Nextlink = Response.xpath ('//span[@class = Next ']/link/@href '). Extract () #还记得我们之前定义的url吗, because the movie too many pages have page-flipping display, Here we get the link to the flip button nextlink if Nextlink: #翻到最后一页是没有连接的, so here we have to judge Nextlink = nextlink[0] Print (nextlink) yield R Equest (Self.url+nextlink,callback=self.parse) #yield中断返回下一页的连接到parse让它重新从下一页开始爬取, callback return function definition where to return

The above is spiders's douban.py inside of the code, now how to implement it.

Open cmd in get_douban file Enter command to execute file: Scrapy crawl Douban carriage return

You will get the following information:



So you realize the use of scrapy simple crawler, climbed the watercress 250, have any comments can be mentioned.

We have not yet explained the other functions in scrapy, such as items.py and so on, first familiar with the simple bar.






Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.