recently learned to use the Scrapy framework to write a reptile, simple crawler is to crawl from the Web page, parse the page, and then data storage and analysis, from the Web page parsing to the data conversion storage. The analytical techniques used in the learning process, the scrapy of each module used and advanced to the distributed crawler to learn the knowledge points, problems encountered and solutions recorded in this to make a summary and memo, but also hope to help students need.
This article focuses on XPath, CSS Parsing Web page syntax and how to use in Scrapy . An overview of XPath and grammar
XPath is a standard for the consortium. Simply put, we can access each element of an HTML page as a path. The main two are//. The former represents all the elements under the path, and the latter represents the child elements under the path. The specific syntax is as follows: basic syntax:
Question # Select all the child nodes of all question elements
/question #选取根元素 question
question/a # Select all the child elements under question element C8/>//div # Select all DIV elements, regardless of where they appear anywhere in the document
Question//div # Select all the div descendant elements under the question element (/Select a direct child element, which is all descendant elements)
question//span/text () #选取 The text value in all SPAN elements under the question element
question//a/@href The value of the href attribute in all a elements under the #选取 question element. @ can be any property name, can be taken to a value
Grammar with a qualified nature
/QUESTION/DIV[1] # Select the first div child element of question. Note that the first one here is
/question/div[last ()] # starting with index 1 select question Last div child element
/question/div[last ()-1] # Select the reciprocal of question Second div child element
//div[@lang] # Select all DIV elements that have the lang attribute
//div[@lang = ' eng '] # Select the div element with all Lang properties eng
Other Grammar supplements
/div/* # Select all of the child nodes that are part of the DIV element
//* # Select all elements
//div/a |//div/p #选 take all div elements, a element or P
//span |//i Nput # Select all span and input elements in the document
two. CSS Syntax overview
Familiar with the front end of the CSS selector will not be unfamiliar, such as jquery through a variety of CSS selector syntax for DOM operations. Here is a brief summary of its grammar, easy to review. Basic query Syntax
* # Select all nodes
#title # Select the element with ID title
. COL-MD # Select all classes containing COL-MD elements
li a # Select all A-dollar under Li ELEMENT
ul + P # Select UL back the first p elements
div#title > UL # Select the first UL sub element of the Div named title with ul
~ P # Select with All p elements adjacent to the URL
span#title:: Text # Select the text value
a.link::attr (HREF) # of the span element with ID title Select the HREF attribute value
attribute-related query syntax
A[title] # Select all the A-element
a[href= ' http://stackoverflow.com/' with the title attribute to select all href attributes of http://stackoverflow.com/ A element
a[href*= "StackOverflow"] # Select all href properties containing StackOverflow a element
a[href^= ' https ' # Select all href attribute values to begin with HTTPS A element
a[href$= '. jpg '] # Select all href attribute values with the. jpg end of a element
input[type=radio]:checked # Choose selected Radio elements
Other Syntax
Div:not (. title) # Select the div element that all class is not title
Li:nth-child (3) # Select the third element
Tr:nth-child (2n) # An even number of elements
three. Introduction to Selector syntax and StackOverflow problem List resolution example
After introducing the analytic grammar above, the following is a concrete look at the use of scrapy.
Scrapy provides a Selector class for Web pages that can receive a piece of HTML code to build, and the response passed back in our parse method is a Htmlresponse object that comes with two methods CSS () and XPath () side method allows us to easily use the above two methods to do the analysis.
A Selectorlist object is returned through CSS () or XPath () resolution, in order to get the element or text, property value, can be obtained using the extract () or Extract_first () method. The Extract_first () method returns none when there is no value, and the former is recommended if fetching directly using index 0 raises an error.
Here are two ways to parse the StackOverflow problem list and get the data from a problem.
The problem interface and source code are as follows:
Problem list Items
List Source code
Source code for a particular problem
As you can see, the list of issues is located under the DIV element with ID question, each problem layout is represented by Class=question-summary, the following ID is used to identify each problem, and can be obtained after the connection with the domain name to the specific problem details interface. Below, we will look at the number of captions, number of votes, the number of respondents, and the size of the tag. Tip: There are two different ways to use XPath and CSS for distance, but the actual situation is generally combined to achieve the most concise parsing. The official document suggests parsing with a CSS when parsing with class. CSS Parsing
def parse_by_css (self, Response): "' There are 50 problems in each Web page, which are stored in the MongoDB database after traversal parsing:p Aram Response:: RE Turn: ' questions = Response.css (' div.question-summary ') for question in questions: # voting The number is in the Class=vote div under the strong, CSS through:: Text or:: attr (property name) # to get the text or a property value, because up to only one value, so direct use of Extract_fi RST () to obtain the text value Question_votes = Question.css ('. Votes strong::text '). Extract_first () # title is in Class=qu
Question_title = Question.css ("A.question-hyperlink::text") in a element of Estion-hyperlink. Extract_first () # question_answers = Question.css ('. answered Strong::text ') below the strong element under the div with class answered. Extract_fi RST () # class is the title attribute value inside the views element question_views = Question.css ('. Views::attr (title) '). Extract_f Irst () # class is tags of the div element under all a text value under all a, because there may be more than one label, so use the extract () method, return a tag text composed of List tags = Question.css ('. tags a::tExt '). Extract () Pass
XPath parsing
# element interpretation and the above CSS parser has been, here only listing code not to repeat
def parse_by_xpath (self, Response):
questions = Response.xpath ("//div[@ Class= ' Question-summary '] for
question in questions:
question_votes = Question.xpath (".//div[@class = ' Votes ']//strong/text () "). Extract_first ()
question_title = Question.xpath (".//a[@class = ' Question-hyperlink ']/ Text () "). Extract_first ()
question_answers = Question.xpath (".//div[contains (@class, ' answered ')]/strong/text ( ). Extract_first ()
question_views = Question.xpath (".//div[contains (@class, ' views ')]/@title"). Extract_first ()
tags = Question.xpath (".//div[contains (@class, ' tags ')]/a/text ()"). Extract ()
Pass
The results of the resolution are as follows:
You can see that the data is in there, the specific explanations are in the comments, and the more detailed syntax for Selector can be found in the official documentation of Scrapy. This article briefly introduced here, the data has been resolved, the next need to package for the item for transmission and processing, the next article will introduce the relevant content of the item. This includes the creation of attributes, the Item Loader mechanism, and the filtering of data, the processing of related methods, and so on.