The Scrapy framework is a framework for Python spiders to use for systematic quick processing and management of data, an application framework for extracting structured data, and a scrapy framework where you can easily manage the data you crawl.
This is my simple understanding of scrapy.
Here does not introduce the concrete principle and the picture demonstration. (You should have a good understanding of simple reptiles, or how to learn scrapy directly)
If you are just ready to learn scrapy, then you should take a closer look. If you have been studying for some time scrapy, then this article may not be suitable for you to learn, here first only talk about getting started
Through the ability of beginners to achieve
Direct Dry Goods: (Objective: To crawl the watercress 250)
Of course you may not have installed scrapy, here I do not cumbersome to explain, the specific point is that you direct pip install scrapy before you need to install scrapy dependent on the environment
(Pip install parsel,pip install TWISTED,PIP install lxml) There are environments where you can view them online.
Find a place where you store scrapy files later. Execute command: scrapy startproject Get_douban
A folder is generated: Here the bread contains some necessary documents of the Scrapy, as a novice we do not care,
Now you need to create a new douban.py file in Get_doubande, the file we use to write the crawler, and here's the douban.py code.
You may need to know about XPath (http://www.w3school.com.cn/xpath/) Here you can get a quick look.
Import scrapy from scrapy.http Import Request class Doubanspider (scrapy. Spider): name = "Douban" #这个name是你必须给它一个唯一的名字 the name after which we execute the file Start_urls = ["https://movie.douban.com/top250"] #这个列表中的ur L can have multiple, it will be executed sequentially, we here simply crawl a URL = "https://movie.douban.com/top250" #因为豆瓣250有翻页操作, we set this URL to page the Def parse (self, Response): #默认函数parse sites = Response.xpath ('//ol[@class = ' grid_view '] ') # ("The path to match the information you need") #xpath是scrapy里面的一种匹配方式, similar to the positive
The expression, there are several other ways to match #这里我们首先获得的是我们需要的信息的那一大块sites. Print ("..... The return information is: ") info = Sites.xpath ('./li ') #从sites中我们再进一步获取到所有电影的所有信息 for I in info: #这里的i是每一部电影的信息 #排名 num =
I.xpath ('./div//em[@class = ""]//text () "). Extract () #获取到的为列表类型 #extract () is the extractor that takes what we match to print (num[0],end="; ")
#标题 title = I.xpath ('.//span[@class = "title"]/text ()). Extract () print (title[0],end= ";") #评论 Remark = I.xpath ('.//span[@class = "Inq"]//text ()). Extract () #分数 score = I.xpath ('./div//span[@class = "rating _num "]//text ()"). Extract () Print (SCORe[0]) Nextlink = Response.xpath ('//span[@class = Next ']/link/@href '). Extract () #还记得我们之前定义的url吗, because the movie too many pages have page-flipping display, Here we get the link to the flip button nextlink if Nextlink: #翻到最后一页是没有连接的, so here we have to judge Nextlink = nextlink[0] Print (nextlink) yield R Equest (Self.url+nextlink,callback=self.parse) #yield中断返回下一页的连接到parse让它重新从下一页开始爬取, callback return function definition where to return
The above is spiders's douban.py inside of the code, now how to implement it.
Open cmd in get_douban file Enter command to execute file: Scrapy crawl Douban carriage return
You will get the following information:
So you realize the use of scrapy simple crawler, climbed the watercress 250, have any comments can be mentioned.
We have not yet explained the other functions in scrapy, such as items.py and so on, first familiar with the simple bar.