The syntax of XPath
XPath syntax-predicate
Create a Scrapy project
Scrapy Startproject Articlespider
Create Scrapy crawler
CD Articlespider
scrapy genspiderjobbole blog.jobbole.com
How to use
You can copy XPath directly in the CHROME->F12 Developer Tool . Here's a test method.
Usually we need to request each time in the Pycharm or editor, we can use the following methods:
Suppose we need to crawl an article from Bole online, linked to http://blog.jobbole.com/112614/
We crawl the title of an article, the time of publication, the number of compliments, the number of favorites, the number of comments. all fields are stored as strings
The command line performs the following
Scrapy Shell http://blog.jobbole.com/112614/
After that, we can get the corresponding content through the following command, Response.xpath () will return <class ' scrapy.selector.unified.SelectorList ' > type, which can be passed through extract () Method gets the content, returns the list
In [1]: title = Response.xpath ('//*[@id = "post-112614"]/div[1]/h1/text ()) in
[2]: print (title)
[<selector Xpath= '//*[@id = "post-112614"]/div[1]/h1/text () ' data= ' Why SQL is defeating NoSQL, and what the future of the data is. ']
in [3]: Print (Title.extract ())
[' Why SQL is defeating NoSQL, what is the future of the data. ']
### #获取时间
Let's take the time again, this section is longer, because I have been summed up to a piece, before the 1.1 point debugging out, the process is visible below the figure
Create_date = Response.xpath ('//*[@class = ' entry-meta-hide-on-mobile ']/text () '). Extract () [0].strip (). replace ("·", ""). Strip ()
Strip () is the removal of the characters specified by the tail
### #获取标题
The code is my direct copy of the XPath path, but here can only extract the data from this article, because you can test the id= "post-112614", this is only applicable to this article, the other is not, so we need to replace the XPath selector
Create_date = Response.xpath ('//*[@id = ' post-112614 ']/div[2]/p/text () '). Extract () [0].strip (). Repalce ("•", ""). Strip ()
By testing, we found that the Entry-header class is globally unique, so we can extract
title = Response.xpath ('//*[@class = ' Entry-header ']/h1/text () '). Extract () [0]
### #获取点赞数
Praise_nums = Response.xpath ("//span[contains (@class, ' vote-post-up ')]/h10/text ()"). Extract () [0]
Contains: matches a string contained in a property value ### #获取收藏, which contains ' Favorites ' and ' Favorites ' two words
Fav_nums = Response.xpath ("//span[contains (@class, ' bookmark-btn ')]/text ()"). Extract () [0].strip () match_re
= Re.match ('. * (\d+). * ', fav_nums)
if match_re:
#获取收藏数
fav_nums = Int (Math_re.group (1))
### #获取评论
Comment_nums = Response.xpath ('//*[@class = ' entry-meta-hide-on-mobile ']/a[2]/text () '). Extract () [0].strip ()
### #获取文章所属标签
This involves going heavy, because there are comments at the beginning of the article and at the end of the text, all of which will be duplicated. The red section below, so we use judgment to remove the duplicated parts.
Tag_list = Response.xpath ("//p[@class = ' entry-meta-hide-on-mobile ']/a/text ()"). Extract ()
tag_list = [element for element in Tag_list if not Element.strip (). EndsWith (' comments ')]
tag = ', '. Join (Tag_list)
This process is shown below
-### #获取文章内容
Content = Response.xpath ('//*[@class = ' entry '] '). Extract () [0]
Complete Code
def parse_detail (self, Response): #获取标题 #可以用//*[@id = "post-112614"]/div[1]/h1/text () gets the value in the label title = Response.x Path ('//*[@class = "Entry-header"]/h1/text ()). Extract () [0] # print (' title ', title) # Re1_selector = Response.xpath (' div[@class = "Entry_header"]/h1/text ()) #获取时间 #获取字符串的话用time. Extract () [0].strip (). Repalce ("•", ""). Strip () Crea Te_date = Response.xpath ('//*[@class = ' entry-meta-hide-on-mobile ']/text () '). Extract () [0].strip (). replace ("•", "").
Strip () #获取点赞数 praise_nums = Response.xpath ("//span[contains (@class, ' vote-post-up ')]/h10/text ()"). Extract () [0] #获取收藏, this contains the ' Favorites ' and ' Favorites ' two characters fav_nums = Response.xpath ("//span[contains (@class, ' bookmark-btn ')]/text ()"). Extract () [0]. Strip () Match_re = Re.match ('. *?
\d+). * ', fav_nums) if match_re: #获取收藏数 fav_nums = Int (Match_re.group (1)) Else:fav_nums = 0
#获取评论数 comment_nums = Response.xpath ('//*[@class = ' entry-meta-hide-on-mobile ']/a[2]/text () '). Extract () [0].strip () Match_re = Re.match ('. *? \d+). * ', comment_nums) if Match_re: # get Favorites comment_nums = Int (Match_re.group (1)) Else:co
mment_nums = 0 #获取文章分类标签 tag_list = Response.xpath ("//p[@class = ' entry-meta-hide-on-mobile ']/a/text ()"). Extract () Tag_list = [element for element in Tag_list if not Element.strip (). EndsWith (' comments ')] tag = ', '. Join (tag_list) con Tent = Response.xpath ('//*[@class = ' entry '] '). Extract () [0]