Scrapy-xpath Usage and examples

Last Update:2018-07-24 Source: Internet

Author: User

Tags xpath

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

The syntax of XPath

XPath syntax-predicate

Create a Scrapy project

Scrapy Startproject Articlespider

Create Scrapy crawler

CD Articlespider
scrapy genspiderjobbole blog.jobbole.com

How to use

You can copy XPath directly in the CHROME->F12 Developer Tool . Here's a test method.

Usually we need to request each time in the Pycharm or editor, we can use the following methods:
Suppose we need to crawl an article from Bole online, linked to http://blog.jobbole.com/112614/
We crawl the title of an article, the time of publication, the number of compliments, the number of favorites, the number of comments. all fields are stored as strings

The command line performs the following

Scrapy Shell http://blog.jobbole.com/112614/

After that, we can get the corresponding content through the following command, Response.xpath () will return <class ' scrapy.selector.unified.SelectorList ' > type, which can be passed through extract () Method gets the content, returns the list

In [1]: title = Response.xpath ('//*[@id = "post-112614"]/div[1]/h1/text ()) in

[2]: print (title)
[<selector Xpath= '//*[@id = "post-112614"]/div[1]/h1/text () ' data= ' Why SQL is defeating NoSQL, and what the future of the data is. ']

in [3]: Print (Title.extract ())
[' Why SQL is defeating NoSQL, what is the future of the data. ']

### #获取时间
Let's take the time again, this section is longer, because I have been summed up to a piece, before the 1.1 point debugging out, the process is visible below the figure

Create_date = Response.xpath ('//*[@class = ' entry-meta-hide-on-mobile ']/text () '). Extract () [0].strip (). replace ("·", ""). Strip ()

Strip () is the removal of the characters specified by the tail
### #获取标题
The code is my direct copy of the XPath path, but here can only extract the data from this article, because you can test the id= "post-112614", this is only applicable to this article, the other is not, so we need to replace the XPath selector

Create_date = Response.xpath ('//*[@id = ' post-112614 ']/div[2]/p/text () '). Extract () [0].strip (). Repalce ("•", ""). Strip ()

By testing, we found that the Entry-header class is globally unique, so we can extract

title = Response.xpath ('//*[@class = ' Entry-header ']/h1/text () '). Extract () [0]

### #获取点赞数

Praise_nums = Response.xpath ("//span[contains (@class, ' vote-post-up ')]/h10/text ()"). Extract () [0]

Contains: matches a string contained in a property value ### #获取收藏, which contains ' Favorites ' and ' Favorites ' two words

Fav_nums = Response.xpath ("//span[contains (@class, ' bookmark-btn ')]/text ()"). Extract () [0].strip () match_re
= Re.match ('. * (\d+). * ', fav_nums)
if match_re:
    #获取收藏数
    fav_nums = Int (Math_re.group (1))

### #获取评论

Comment_nums = Response.xpath ('//*[@class = ' entry-meta-hide-on-mobile ']/a[2]/text () '). Extract () [0].strip ()

### #获取文章所属标签
This involves going heavy, because there are comments at the beginning of the article and at the end of the text, all of which will be duplicated. The red section below, so we use judgment to remove the duplicated parts.

Tag_list = Response.xpath ("//p[@class = ' entry-meta-hide-on-mobile ']/a/text ()"). Extract ()
tag_list = [element for element in Tag_list if not Element.strip (). EndsWith (' comments ')]
tag = ', '. Join (Tag_list)

This process is shown below

-### #获取文章内容

Content = Response.xpath ('//*[@class = ' entry '] '). Extract () [0]

Complete Code

def parse_detail (self, Response): #获取标题 #可以用//*[@id = "post-112614"]/div[1]/h1/text () gets the value in the label title = Response.x Path ('//*[@class = "Entry-header"]/h1/text ()). Extract () [0] # print (' title ', title) # Re1_selector = Response.xpath (' div[@class = "Entry_header"]/h1/text ()) #获取时间 #获取字符串的话用time. Extract () [0].strip (). Repalce ("•", ""). Strip () Crea Te_date = Response.xpath ('//*[@class = ' entry-meta-hide-on-mobile ']/text () '). Extract () [0].strip (). replace ("•", "").
    Strip () #获取点赞数 praise_nums = Response.xpath ("//span[contains (@class, ' vote-post-up ')]/h10/text ()"). Extract () [0] #获取收藏, this contains the ' Favorites ' and ' Favorites ' two characters fav_nums = Response.xpath ("//span[contains (@class, ' bookmark-btn ')]/text ()"). Extract () [0]. Strip () Match_re = Re.match ('. *? 
    \d+). * ', fav_nums) if match_re: #获取收藏数 fav_nums = Int (Match_re.group (1)) Else:fav_nums = 0
 #获取评论数 comment_nums = Response.xpath ('//*[@class = ' entry-meta-hide-on-mobile ']/a[2]/text () '). Extract () [0].strip ()   Match_re = Re.match ('. *? \d+). * ', comment_nums) if Match_re: # get Favorites comment_nums = Int (Match_re.group (1)) Else:co
    mment_nums = 0 #获取文章分类标签 tag_list = Response.xpath ("//p[@class = ' entry-meta-hide-on-mobile ']/a/text ()"). Extract () Tag_list = [element for element in Tag_list if not Element.strip (). EndsWith (' comments ')] tag = ', '. Join (tag_list) con Tent = Response.xpath ('//*[@class = ' entry '] '). Extract () [0]

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More