1.windows under Install scrapy:cmd command line: CD to Python's scripts directory, then run pip install command
And then there's scrapy under the Pycharmide:
Run the scrapy command under CMD, Error!!! :
Workaround:
Create a new sitecustomize.py under the Python lib\site-packages folder:
Import sys sys.setdefaultencoding (' gb2312 ')
Run scrapy again under CMD, success:
2.Scrapy selector and XPath and CSS: Select a part of an HTML file by using a specific XPath or CSS expression
(1) XPath is a language used to select nodes in an XML file, or it can be used in HTML, a language for finding information in an XML document , and XPath can be used to traverse elements and attributes in an XML document.
XPath contains more than 100 built-in functions for string values, numeric, date and time comparisons, node and QName processing, sequence processing, logical values, and more
(2) In XPath, there are 7 types of nodes: elements, attributes, text, namespaces, processing instructions, annotations, and document nodes (or root nodes). The XML document is treated as a node tree, and the root of the tree is called the document node or root node
To make a simple XML file:
<superhero>
<class>
<name lang= "en" >tony Stark </name>
<alias>iron Mans </alias>
<sex>male </sex>
<birthday>1969 </birthday>
<age>47 </age>
</class>
<class>
<name lang= "en" >peter Benjamin Parker </name>
<alias>spider Mans </alias>
<sex>male </sex>
<birthday>unknow </birthday>
<age>unknown </age>
</class>
<class>
<name lang= "en" >steven Rogers </name>
<alias>captain America </alias>
<sex>male </sex>
<birthday>19200704 </birthday>
<age>96 </age>
</class>
</superhero>
(3) XPath uses a path expression to select a node in an XML document: Common path expressions are as follows:
NodeName: Selects all child nodes of this node
/: Select from root node
: Selects nodes in the document from the current node of the matching selection, regardless of their location
.: Select the current node
.. : Selects the parent node of the current node
@: Select Properties
*: Matches any element node
@*: Matches any attribute node
Node (): Matches nodes of any type
(4) How the XPath selector collects data:
(5) Nested selector:
3.CSS Selector (cascading style sheets): CSS rules consist of two main components: selectors, and one or more declarations
Selector{declaration1;declaration2;.......declarationn}
CSS Selectors: Examples:
. class. Intro Select all elements of class= "Intro"
#id #firstname Select all elements of id= "FirstName"
* * Select all elements
Element p Select all <p> elements
Element,element div,p Select all <div> elements and all <p> elements
Element element div p Select all p elements inside the <div> element
[Attribute] [target] Selects all elements with the target property
[Attribute=value] [Target=_blank] Select all elements of target= "_blank"
4.CSS Selector Test:
5. Additional selectors:
The XPath selector also has a. Re () method, which is used to extract data through regular expressions, but differs from using. XPath () or CSS (), and the Re () method returns a list of Unicode strings. So. Unable to construct nested. Re () call
2017.07.26 python web crawler scrapy crawler Frame