Based on the above knowledge of CSS Basic grammar, now to implement the parsing of the field. The first is to parse the title. Open the web Developer tool to find the source code for the title. This article mainly introduces the CSS selector to implement the field analysis of the relevant data, the need for friends can refer to, hope to help everyone
Discovery is in the p class="entry-header"
h1 node below, so open the Scrapy shell for debugging
But I do not want to
Note that there are two colons. It's really handy to use CSS selectors. Similarly I use CSS to implement field parsing. The code is as follows
#-*-coding:utf-8-*-Import scrapy Import re class Jobbolespider (scrapy. Spider): name = ' Jobbole ' allowed_domains = [' blog.jobbole.com '] start_urls = [' http://blog.jobbole.com/113 549/'] def parse (self, Response): # title = Response.xpath ('//p[@class = ' Entry-header ']/h1/text () '). Extract () [0] # create_date = Response.xpath ("//p[@class = ' entry-meta-hide-on-mobile ']/text ()"). Extract () [0].strip (). Rep Lace ("•", ""). Strip () # praise_numbers = Response.xpath ("//span[contains (@class, ' vote-post-up ')]/h10/text ()"). Extr Act () [0] # fav_nums = Response.xpath ("//span[contains (@class, ' bookmark-btn ')]/text ()"). Extract () [0] # MA Tch_re = Re.match (". *?" ( \d+). * ", fav_nums) # if match_re: # fav_nums = Match_re.group (1) # comment_nums = response.x Path ("//a[@href = ' #article-comment ']/span"). Extract () [0] # match_re = Re.match (". *") ( \d+). * ", comment_nums) # if match_re: # comment_nums= Match_re.group (1) # content = Response.xpath ("//p[@class = ' entry ']"). Extract () [0] #通过CSS选择器提取字段 title = Response.css (". Entry-header h1::text"). Extract () [0] create_date = Response.css (". Entry-meta-hide-on-mobile::tex T "). Extract () [0].strip (). replace (" • "," "). Strip () Praise_numbers = Response.css (". vote-post-up h10::text "). Extract () [0] fav_nums = response.css ("Span.bookmark-btn::text"). Extract () [0] match_re = Re.match (". *") ( \d+). * ", fav_nums) if match_re:fav_nums = Match_re.group (1) comment_nums = Response.css (" A [href= ' #article-comment '] span::text "). Extract () [0] match_re = Re.match (". * ") ( \d+). * ", comment_nums) if match_re:comment_nums = Match_re.group (1) content = Response.css ("P.entry"). Extract () [0] tags = response.css ("P.entry-meta-hide-on-mobile a::text"). Extract () [0] Pass