Scrapy-xpath Usage and examples

Source: Internet
Author: User
Tags xpath
The syntax of XPath

XPath syntax-predicate

Create a Scrapy project

Scrapy Startproject Articlespider
Create Scrapy crawler
CD Articlespider
scrapy genspiderjobbole blog.jobbole.com
How to use

You can copy XPath directly in the CHROME->F12 Developer Tool . Here's a test method.

Usually we need to request each time in the Pycharm or editor, we can use the following methods:
Suppose we need to crawl an article from Bole online, linked to http://blog.jobbole.com/112614/
We crawl the title of an article, the time of publication, the number of compliments, the number of favorites, the number of comments. all fields are stored as strings

The command line performs the following

Scrapy Shell http://blog.jobbole.com/112614/

After that, we can get the corresponding content through the following command, Response.xpath () will return <class ' scrapy.selector.unified.SelectorList ' > type, which can be passed through extract () Method gets the content, returns the list

In [1]: title = Response.xpath ('//*[@id = "post-112614"]/div[1]/h1/text ()) in

[2]: print (title)
[<selector Xpath= '//*[@id = "post-112614"]/div[1]/h1/text () ' data= ' Why SQL is defeating NoSQL, and what the future of the data is. ']

in [3]: Print (Title.extract ())
[' Why SQL is defeating NoSQL, what is the future of the data. ']
### #获取时间
Let's take the time again, this section is longer, because I have been summed up to a piece, before the 1.1 point debugging out, the process is visible below the figure
Create_date = Response.xpath ('//*[@class = ' entry-meta-hide-on-mobile ']/text () '). Extract () [0].strip (). replace ("·", ""). Strip ()

Strip () is the removal of the characters specified by the tail
### #获取标题
The code is my direct copy of the XPath path, but here can only extract the data from this article, because you can test the id= "post-112614", this is only applicable to this article, the other is not, so we need to replace the XPath selector

Create_date = Response.xpath ('//*[@id = ' post-112614 ']/div[2]/p/text () '). Extract () [0].strip (). Repalce ("•", ""). Strip ()


By testing, we found that the Entry-header class is globally unique, so we can extract

title = Response.xpath ('//*[@class = ' Entry-header ']/h1/text () '). Extract () [0]
### #获取点赞数
Praise_nums = Response.xpath ("//span[contains (@class, ' vote-post-up ')]/h10/text ()"). Extract () [0]

Contains: matches a string contained in a property value ### #获取收藏, which contains ' Favorites ' and ' Favorites ' two words

Fav_nums = Response.xpath ("//span[contains (@class, ' bookmark-btn ')]/text ()"). Extract () [0].strip () match_re
= Re.match ('. * (\d+). * ', fav_nums)
if match_re:
    #获取收藏数
    fav_nums = Int (Math_re.group (1))
### #获取评论
Comment_nums = Response.xpath ('//*[@class = ' entry-meta-hide-on-mobile ']/a[2]/text () '). Extract () [0].strip ()
### #获取文章所属标签
This involves going heavy, because there are comments at the beginning of the article and at the end of the text, all of which will be duplicated. The red section below, so we use judgment to remove the duplicated parts.

Tag_list = Response.xpath ("//p[@class = ' entry-meta-hide-on-mobile ']/a/text ()"). Extract ()
tag_list = [element for element in Tag_list if not Element.strip (). EndsWith (' comments ')]
tag = ', '. Join (Tag_list)

This process is shown below

-### #获取文章内容

Content = Response.xpath ('//*[@class = ' entry '] '). Extract () [0]
Complete Code
def parse_detail (self, Response): #获取标题 #可以用//*[@id = "post-112614"]/div[1]/h1/text () gets the value in the label title = Response.x Path ('//*[@class = "Entry-header"]/h1/text ()). Extract () [0] # print (' title ', title) # Re1_selector = Response.xpath (' div[@class = "Entry_header"]/h1/text ()) #获取时间 #获取字符串的话用time. Extract () [0].strip (). Repalce ("•", ""). Strip () Crea Te_date = Response.xpath ('//*[@class = ' entry-meta-hide-on-mobile ']/text () '). Extract () [0].strip (). replace ("•", "").
    Strip () #获取点赞数 praise_nums = Response.xpath ("//span[contains (@class, ' vote-post-up ')]/h10/text ()"). Extract () [0] #获取收藏, this contains the ' Favorites ' and ' Favorites ' two characters fav_nums = Response.xpath ("//span[contains (@class, ' bookmark-btn ')]/text ()"). Extract () [0]. Strip () Match_re = Re.match ('. *? 
    \d+). * ', fav_nums) if match_re: #获取收藏数 fav_nums = Int (Match_re.group (1)) Else:fav_nums = 0
 #获取评论数 comment_nums = Response.xpath ('//*[@class = ' entry-meta-hide-on-mobile ']/a[2]/text () '). Extract () [0].strip ()   Match_re = Re.match ('. *? \d+). * ', comment_nums) if Match_re: # get Favorites comment_nums = Int (Match_re.group (1)) Else:co
    mment_nums = 0 #获取文章分类标签 tag_list = Response.xpath ("//p[@class = ' entry-meta-hide-on-mobile ']/a/text ()"). Extract () Tag_list = [element for element in Tag_list if not Element.strip (). EndsWith (' comments ')] tag = ', '. Join (tag_list) con Tent = Response.xpath ('//*[@class = ' entry '] '). Extract () [0]

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.