Parsing data from an HTML source file library usually has the following common libraries to use: BeautifulSoup is a very popular web analytics library among programmers, it constructs a Python object based on the structure of HTML code, and it's very reasonable to handle bad tags, but it has one drawback: slow. lxml is a Python XML parsing library (or parsing HTML) based on ElementTree (not part of the Python standard library).
Scrapy has its own set of mechanisms for extracting data. They are called selectors (seletors) because they "select" a part of the HTML file through a specific XPath or CSS expression. XPath is a language used to select nodes in an XML file, or it can be used on HTML. CSS is a language that is styled for HTML documents. The selector is defined by it and is associated with the style of a particular HTML element.
The scrapy selector is built on top of the lxml library, which means that they are very similar in speed and resolution accuracy. Unlike the bloated lxml API, the API is short and concise. This is because the lxml library can be used for a number of tasks in addition to selecting a tagged document. 1. Using selectors 1.1 constructing selectors
Scrapy Selectors is an instance of the selector class, created by passing in text or Textresponse, which automatically chooses the parsing rule (XML or HTML) based on the type passed in:
BODY = '
Build from response:
Response = Htmlresponse (url= ' http://example.com ', body=body)
Selector (response=response). XPath ('//span/text () '). Extract ()
The response object provides a selector to the. Selector property, which you can use at any time:
Response.selector.xpath ('//span/text () '). Extract ()
1.2 Using Selectors
Explain how to use selectors in the following document:
Open Shell:
Scrapy Shell http://doc.scrapy.org/en/latest/_static/selectors-sample1.html
When the shell is loaded, we will get the shell variable named response and bind a selector to its Response.selector property.
To view the text in title:
Response.selector.xpath ('//title/text () ')
Because of the prevalence of XPath and CSS queries in response, Scrapy offers two practical shortcuts: Response.xpath () and Response.css ():
Response.xpath ('//title/text () ')
response.css (' Title::text ')
The XPath () and. CSS () methods return an instance of a class selectorlist, which is a list of new selectors. This API can be used to quickly extract nested data. To extract authentic original data, you need to invoke the. Extract () method as follows:
RESPONSE.CSS (' img '). XPath (' @src '). Extract ()
If you only want the first matching element, you can use ·. Extract_first () ·:
Response.xpath ('//div[@id = "Images"]/a/text ()). Extract_first ()
Note The CSS selector can use the CSS3 pseudo element (pseudo-elements) to select the text or attribute node:
Response.css (' Title::text '). Extract ()
Now we'll get the root url (base URL) and some picture links:
1. Response.xpath ('//base/@href '). Extract ()
1. Response.css (' base::attr (HREF) '). Extract ()
2. Response.xpath ('//a[contains (@href, "image")]/@href '). Extract ()
2. Response.css (' a[href*=image]::attr (HREF) ' ). Extract ()
3. Response.xpath ('//a[contains (@href, "image")]/img/@src '). Extract ()
3. Response.css (' A[href *=image] img::attr (SRC) '). Extract ()
1.3 Nesting Selectors (nested selector)
The Selector method returns a list of selectors of the same type, so you can also invoke the Selector method on these selectors
Links = Response.xpath ('//a[contains (@href, "image"])
links.extract ()
for index, link in Enumerate (links): C10/>args = (index, Link.xpath (' @href '). Extract (), Link.xpath (' img/@src '). Extract ())
print ' Link number%d points To URL%s and image%s '% args
1.4 Using selectors with regular expressions (regular expression)
Selector has a. Re () method that is used to extract data from a regular expression. Unlike the use of. XPath () or. css () methods, the. Re () method returns a list of Unicode strings, so nested. Re () calls cannot be constructed.
Here is an example of extracting the name of the image from the HTML code above:
Response.xpath ('//a[contains (@href, "image")]/text ()). Re (R ' name:\s* (. *) ')
1.5 Working with relative xpaths (relative XPath)
If you use a nested selector and use an XPath starting with/, the XPath uses an absolute path to the document.
For example, suppose you want to extract all the <p> elements in the <div> element. First, you will get all the <div> elements first:
DIVs = Response.xpath ('//div ')
In the beginning, you might try to use the following error method, because it actually extracts all of the <p> elements from within the entire document, not just the <div> elements:
For P in Divs.xpath ('//p '): # This are wrong-gets all ' <p> ' from the whole document
print p.extract ()
Here is a more appropriate approach (note the point prefix of the.//p XPath):
For P in Divs.xpath ('.//p '): # Extracts all ' <p> ' inside
print p.extract ()
Another common scenario would be to extract all the direct <p> results:
For P in Divs.xpath (' P '):
print p.extract ()
1.6 Using EXSLT Extensions
Built on top of lxml, the scrapy selector also supports some EXSLT extensions that can be used in XPath expressions with these predefined namespaces:
Prefix |
Name space |
Use |
Re |
Http://exslt.org/regular-expressions |
Regular expressions |
Set |
Http://exslt.org/sets |
Collection operations |
Regular expression:
For example, the test () function can be useful when the Starts-with () or contains () of an XPath does not meet the requirements.
For example, select a link in the list that has the "class" element and ends with a number:
From scrapy import Selector
doc = ""
<div>
<ul>
<li class= "item-0 ><a href=" Link1.html ">first item</a></li> <li class=" item-1 "><a href=
" link2.html ">second Item </a></li>
<li class= "item-inactive" ><a href= "link3.html" >third item</a></li >
<li class= "item-1" ><a href= "link4.html" >fourth item</a></li> <li "class=
" item-0 "><a href=" link5.html ">fifth item</a></li>
</ul>
</div>" "
sel = Selector (Text=doc, type= "html")
Sel.xpath ('//li//@href '). Extract ()
Sel.xpath ('//li[re:test ( @class, "item-\d$")]//@href '). Extract ()
The C language Library LIBXSLT does not natively support EXSLT regular expressions, so lxml uses the Python re module hooks when implemented. Therefore, using the RegExp function in an XPath expression may sacrifice a small amount of performance.
Collection Action:
Collection operations can be easily used to remove portions of a document tree before extracting text elements.
For example, use the Itemscopes group and the corresponding itemprops to extract the micro-data (microdata) (sample content from Http://schema.org/Product):
doc = "" "... <div itemscope itemtype=" http://schema.org/Product "> <span itemprop=" name ">kenmore White 1 7 "microwave</span> ... ... <div i temprop= "aggregaterating" ... itemscope itemtype= "http://schema.org/AggregateRating" > ... Rated <span itemprop= "Ratingvalue" >3.5</SPAN>/5. Based on <span itemprop= "Reviewcount" >11</ Span> customer reviews ... </div> ... <div itemprop= "offers" Itemscope itemtype= "Http://schema.org/Off" Er "> ... <span itemprop=" price ">$55.00</span> <link itemprop=" Availability "href=" Http://sch " Ema.org/instock "/>in stock ... </div> ...
Product Description: ... <span itemprop= "description" >0.7 cubic, feet, countertop. ... Has six preset cooking categories convenience and features like ... Add-a-minute and Child lock.</span> ... CustomeR Reviews: ... <div itemprop= "review" Itemscope itemtype= "Http://schema.org/Review" > ... <span itemprop = "Name" >not a happy camper</span>-... by <span itemprop= "Author" >ELLIE</SPAN> <meta itemprop= "datepublished" content= "2011-04-01" >april 1, ... <div itemprop= "reviewrating" Itemscope itemtype = "Http://schema.org/Rating" > ... <meta itemprop= "worstrating" content = "1" > ... <span itemprop= "R Atingvalue ">1</span>/... <span itemprop=" bestrating ... </div> ... <span itemprop= "description" >the lamp burned out and now I have to replace ... it. </span> </div> ... <div itemprop= "review" Itemscope itemtype= "Http://schema.org/Review" >. . <span itemprop= "Name" >value purchase</span>-... by <span itemprop= "Author" >LUCAS</SPAN> . <meta itemprop= "datepublished" content= "2011-03-25" >march <div itemprop= "reviewrating" Itemscope itemtype= "http://schema.org/Rating" >
... <meta itemprop= "worstrating" content = "1"/> ... <span itemprop= "Ratingvalue" >4</span>/ ... <span itemprop= "bestrating" >5</span>stars ... <span itemprop= "Descripti On ' >great microwave for the ' price. It is small and ... fits in the My apartment.</span> ... </div> ... </div> ..... ... "" >>> >>> for Scope in Sel.xpath ('//div[@itemscope] '): ... print "Current scope:", Scope.xpath (' @i Temtype '). Extract () ... props = Scope.xpath (' ... set:difference::*/@itemprop, ... .//*[@itemscope]/*/@itemprop) ... print "Properties:", Props.extract () ... prin T current scope: [u ' http://schema.org/Product '] Properties: [u ' name ', U ' aggregaterating ', U ' offers ', U ' description ', u' Review ', U ' review '] current scope: [u ' http://schema.org/AggregateRating '] Properties: [u ' ratingvalue ', U ' reviewcount '] Current scope: [u ' Http://schema.org/Offer '] Properties: [u ' price ', u ' availability '] current scope: [u ' http://sche Ma.org/review '] Properties: [u ' name ', U ' author ', U ' datepublished ', U ' reviewrating ', U ' description '] current scope: [u ' Http://schema.org/Rating '] Properties: [u ' worstrating ', U ' ratingvalue ', U ' bestrating '] current scope: [u ' http://sche Ma.org/review '] Properties: [u ' name ', U ' author ', U ' datepublished ', U ' reviewrating ', U ' description '] current scope: [u
' Http://schema.org/Rating '] Properties: [u ' worstrating ', U ' ratingvalue ', U ' bestrating ']
Here, we first iterate over the itemscope element, for each of these elements, we look for all the itemprops elements and exclude those elements that are themselves within another itemscope. 1.7 Some XPath Tips
1.7.1 Use the text nodes carefully
When you want to use text content as an XPath function parameter, avoid using the.//text (), and replace it with.
This is because.//text () produces a collection of text elements-a collection of nodes. When a node-set is converted to a string (for example, when it is passed as an argument to the contains () or Start-with () function), it returns only the first element.
Examples are as follows:
>>> from scrapy import Selector
>>> sel = Selector (text= ' <a href= "#" >click Lt;strong>next page</strong></a> ')
To convert a node-set into a string:
>>> Sel.xpath ('//a//text () '). Extract () # check Node-set
[u ' Click here to go to the ', U ' Next Page ']
>&G T;> Sel.xpath ("String (//a[1]//text ())"). Extract () #转换成string
[u ' Click here to go to the ']
The node is converted to a string, but it itself and the child nodes are all placed together.
>>> Sel.xpath ("//a[1]"). Extract () # Select the ' I
' <a href= ' # ' >click Rong>next page</strong></a> ']
>>> sel.xpath ("String (//a[1)"). Extract () # convert it to String
[u ' Click here to the Next Page ']
Therefore, using.//text () Node-set will not get any results:
>>> Sel.xpath ("//a[contains (.//text (), ' Next Page ')]". Extract ()
[]
However, the use of. Will work:
>>> Sel.xpath ("//a[contains (., ' Next Page ')]"). Extract ()
[u ' <a href= "#" >click this to go to the < Strong>next page</strong></a> ']
1.7.2 Note the difference between//node[1] and (//node) [1] //node[1] Select the first child node of their parent node (occurring-I under their respective) (/ node) [1] selects all node in the document and selects the first
1.7.3 When you pass class query, consider using CSS
Because an element may contain more than one CSS class, it can be tedious to select elements in an XPath way:
*[contains (Concat (', Normalize-space (@class), '), ' SomeClass ')]
If you use @class= ' SomeClass ' to omit elements that contain other classes, if you use contains (@class, ' SomeClass ') to compensate, you will find that it contains extra elements that contain the same someclass.
Therefore, Scrapy allows for chained use of selectors , so in most cases you can select class with CSS and then use XPath:
>>> from scrapy import Selector
>>> sel = Selector (text= ' <div class= "hero Shout" ><time Datetime= "2014-07-23 19:00" >special date</time></div> ')
>>> sel.css ('. Shout '). XPath ('./time/@datetime '). Extract ()
[u ' 2014-07-23 19:00]
This is much higher than the lengthy XPath above. 2. A detailed introduction to selector
Class Scrapy.selector.Selector(Response=none, Text=none, Type=none)
Selector is an encapsulation of the response that is used to select specific content.
The following are the main member variables of selector: Response a Htmlresponse or Xmlresponse object text A Unicode string or utf-8 text that is valid when response is empty. Using both text and response is undefined behavior
Type defines selector types, which can be HTML, XML, or none (default) if type is none, then selector automatically selects the best type according to response, and if text is defined, the default is HTML type Response type determination: Xml:xmlresponse html:htmlresponse
HTML: Other types
If the type is already set, the set type is forced to use.
Primary member function :
XPath () looks for the node that matches the XPath query and returns an instance result of Selectorlist, single all its elements. The returned list element also implements the Selector interface. Query is a string containing an XPath query request.
CSS () applies a given CSS selector to return an instance of Selectorlist. In the background,CSS queries are converted to XPath queries through the Cssselect library and the run. XPath () method. Extract () serialization and returns a list of the matching nodes to a Unicode string. The end is the percentage of encoded content Reg (Regex) applies the given regex and returns a list of matching Unicode strings. A regex can be a compiled regular expression, or it can be a string that will be compiled by Re.compile (regex) as a regular expression. register_namespaces (prefix, URI) registers the given namespace, which will be used in Selector. Without registering namespaces, you will not be able to select or extract data from a non-standard namespace. remove_namespaces () removes all namespaces, allowing the use of a small number of namespaces xpaths to traverse the document __NONZERO__ () returns true if any real document is selected, or False. That is, the Boolean value of the Selector is determined by the content it chooses.
Selectorlist Objects
Class Scrapy.selector.SelectorList
The Selectorlist class is a subclass of the built-in list class, providing some additional methods. XPath (query) invokes the. XPath () method on each element in the list, returning the result for another single selectorlist css (query) to invoke the. css () method that returns the result to another single Selectorlist extract () invokes the. Extract () method on the individual elements in the list, returning the list of Unicode strings with the result single the. Re (), which returns a list of Unicode strings with the result single Returns True if the __nonzero__ () list is not NULL, otherwise returns false
Selector Examples on XML responses
We assume that there is already a Selector that is instantiated through the Xmlresponse object, as follows:
sel = Selector (xml_response)
Select all the elements and return to Selectorlist:
Sel.xpath ("//product")
Extract all the prices from the Google Base XML feed, which requires registering a namespace:
Sel.register_namespace ("G", "http://base.google.com/ns/1.0")
Sel.xpath ("//g:price"). Extract ()
To remove a namespace
When dealing with a reptile project, you can completely remove the namespace and simply deal with the element name, which makes it much easier to write more simple/practical XPath. You can use the Selector.remove_namespaces () method for this purpose.
Let's take a look at an example to explain this in the GitHub blog's Atom subscription.
First, we open the shell using the URL that we want to crawl:
Scrapy Shell Https://github.com/blog.atom
Once inside the shell, we can try to select all the <link> objects and see no results (because the Atom XML namespace confuses these nodes):
>>> Response.xpath ("//link")
[]
But once we call the Selector.remove_namespaces () method, all nodes can access them directly by their name:
>>> response.selector.remove_namespaces ()
>>> response.xpath ("//link")
[<selector Xpath= '//link ' data=u ' <link xmlns= ' Http://www.w3.org/2005/Atom ', <selector xpath= '//link ' data=u '
<link xmlns= "Http://www.w3.org/2005/Atom",
...
If you are not always called on why the namespace removal operation will need to be invoked manually, there is confusion. This is because of the following two reasons, in the following order: Removing namespaces requires iterating over and modifying all of the files ' nodes, which requires a certain amount of performance consumption for all document operations that scrapy crawls, and it does need to use namespaces, but some of the elements have names that conflict with namespaces. Although these conditions are very rare.
XPath resolves a method with a namespace page:
If The XPath expression does not include a prefix, it is assumed this namespace URI is the empty. If your XML includes a default namespace, you must still add a prefix and namespace URIs to the XmlNamespaceManager; Otherwise, you won't be get any nodes selected
The preceding passage means that if XPath does not specify a namespace, its namespace is empty. If the XML file is to be parsed with a default namespace, you must add the prefix of that namespace and add the URI of the namespace to the XmlNamespaceManager, otherwise you will not get any query results.
For Scrapy, the register_namespaces (prefix, URI) and remove_namespaces () two functions are provided to solve this problem.