Scrapy Introductory Learning Notes (2)--XPath and CSS parsing and parsing Web pages sample

Last Update:2018-07-29 Source: Internet

Author: User

Tags xpath

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

recently learned to use the Scrapy framework to write a reptile, simple crawler is to crawl from the Web page, parse the page, and then data storage and analysis, from the Web page parsing to the data conversion storage. The analytical techniques used in the learning process, the scrapy of each module used and advanced to the distributed crawler to learn the knowledge points, problems encountered and solutions recorded in this to make a summary and memo, but also hope to help students need.

This article focuses on XPath, CSS Parsing Web page syntax and how to use in Scrapy . An overview of XPath and grammar

XPath is a standard for the consortium. Simply put, we can access each element of an HTML page as a path. The main two are//. The former represents all the elements under the path, and the latter represents the child elements under the path. The specific syntax is as follows: basic syntax:

Question      # Select all the child nodes of all question elements
/question     #选取根元素 question
question/a    # Select all the child elements under question element C8/>//div         # Select all DIV elements, regardless of where they appear anywhere in the document
Question//div # Select all the div descendant elements under the question element (/Select a direct child element, which is all descendant elements) 
  
   question//span/text () #选取 The text value in all SPAN elements under the question element
question//a/@href The     value of the href attribute in all a elements under the #选取 question element. @ can be any property name, can be taken to a value

Grammar with a qualified nature

/QUESTION/DIV[1]        # Select the first div child element of question. Note that the first one here is
/question/div[last ()] # starting with index 1   select question Last div child element
/question/div[last ()-1] # Select the reciprocal of question Second div child element
//div[@lang]            # Select all DIV elements that have the lang attribute
//div[@lang = ' eng ']      # Select the div element with all Lang properties eng

Other Grammar supplements

/div/*    # Select all of the child nodes that are part of the DIV element
//*       # Select all elements
//div/a |//div/p #选 take all div elements, a element or P
//span |//i Nput  # Select all span and input elements in the document

two. CSS Syntax overview

Familiar with the front end of the CSS selector will not be unfamiliar, such as jquery through a variety of CSS selector syntax for DOM operations. Here is a brief summary of its grammar, easy to review. Basic query Syntax

*        # Select all nodes
#title    # Select the element with ID title
. COL-MD   # Select all classes containing COL-MD elements
li a      # Select all A-dollar under Li ELEMENT
ul + P    # Select UL back the first p elements
div#title > UL   # Select the first UL sub element of the Div named title with ul
~ P    # Select with  All p elements adjacent to the URL

span#title:: Text  # Select the text value
a.link::attr (HREF) # of the span element with ID title Select the HREF attribute value

attribute-related query syntax

A[title]  # Select all the A-element
a[href= ' http://stackoverflow.com/' with the title attribute to select all href attributes of http://stackoverflow.com/  A element
a[href*= "StackOverflow"] # Select all href properties containing StackOverflow a element
a[href^= ' https ' # Select all href attribute values to begin with HTTPS A element
a[href$= '. jpg ']  # Select all href attribute values with the. jpg end of a element
input[type=radio]:checked # Choose selected Radio elements

Other Syntax

Div:not (. title)   # Select the div element that all class is not title
Li:nth-child (3) # Select the third element
Tr:nth-child (2n) # An even number of elements

three. Introduction to Selector syntax and StackOverflow problem List resolution example

After introducing the analytic grammar above, the following is a concrete look at the use of scrapy.

Scrapy provides a Selector class for Web pages that can receive a piece of HTML code to build, and the response passed back in our parse method is a Htmlresponse object that comes with two methods CSS () and XPath () side method allows us to easily use the above two methods to do the analysis.

A Selectorlist object is returned through CSS () or XPath () resolution, in order to get the element or text, property value, can be obtained using the extract () or Extract_first () method. The Extract_first () method returns none when there is no value, and the former is recommended if fetching directly using index 0 raises an error.

Here are two ways to parse the StackOverflow problem list and get the data from a problem.

The problem interface and source code are as follows:

Problem list Items

List Source code

Source code for a particular problem

As you can see, the list of issues is located under the DIV element with ID question, each problem layout is represented by Class=question-summary, the following ID is used to identify each problem, and can be obtained after the connection with the domain name to the specific problem details interface. Below, we will look at the number of captions, number of votes, the number of respondents, and the size of the tag. Tip: There are two different ways to use XPath and CSS for distance, but the actual situation is generally combined to achieve the most concise parsing. The official document suggests parsing with a CSS when parsing with class. CSS Parsing

def parse_by_css (self, Response): "' There are 50 problems in each Web page, which are stored in the MongoDB database after traversal parsing:p Aram Response:: RE Turn: ' questions = Response.css (' div.question-summary ') for question in questions: # voting The number is in the Class=vote div under the strong, CSS through:: Text or:: attr (property name) # to get the text or a property value, because up to only one value, so direct use of Extract_fi RST () to obtain the text value Question_votes = Question.css ('. Votes strong::text '). Extract_first () # title is in Class=qu
           Question_title = Question.css ("A.question-hyperlink::text") in a element of Estion-hyperlink. Extract_first () # question_answers = Question.css ('. answered Strong::text ') below the strong element under the div with class answered. Extract_fi RST () # class is the title attribute value inside the views element question_views = Question.css ('. Views::attr (title) '). Extract_f Irst () # class is tags of the div element under all a text value under all a, because there may be more than one label, so use the extract () method, return a tag text composed of List tags = Question.css ('. tags a::tExt '). Extract () Pass

XPath parsing

# element interpretation and the above CSS parser has been, here only listing code not to repeat
def parse_by_xpath (self, Response):
      questions = Response.xpath ("//div[@ Class= ' Question-summary '] for
      question in questions:
          question_votes = Question.xpath (".//div[@class = ' Votes ']//strong/text () "). Extract_first ()
          question_title = Question.xpath (".//a[@class = ' Question-hyperlink ']/ Text () "). Extract_first ()
          question_answers = Question.xpath (".//div[contains (@class, ' answered ')]/strong/text ( ). Extract_first ()
          question_views = Question.xpath (".//div[contains (@class, ' views ')]/@title"). Extract_first ()
          tags = Question.xpath (".//div[contains (@class, ' tags ')]/a/text ()"). Extract ()
          Pass

The results of the resolution are as follows:

You can see that the data is in there, the specific explanations are in the comments, and the more detailed syntax for Selector can be found in the official documentation of Scrapy. This article briefly introduced here, the data has been resolved, the next need to package for the item for transmission and processing, the next article will introduce the relevant content of the item. This includes the creation of attributes, the Item Loader mechanism, and the filtering of data, the processing of related methods, and so on.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More