1, build the environment
Here I am using Anaconda,anaconda that integrates a lot of third-party libraries about Python scientific computing, mainly for easy installation and Anaconda with Spyder.
Download Anaconda here
Compare recommended use of Python2.7
Installing Scrapy under Anaconda is also very simple. CMD into the command line, direct input Conda install scrapy, and then point "Y", very simple can be installed successfully.
This builds up the environment. 2. Preliminary understanding of Scrapy
Scrapy official website tutorial, suggest to see OH
The first thing to solve is how to create a new Scrapy project from the command line to the directory where you want to create a new project, Scrapy Startproject Newone
Open the directory to see a new folder, open the folder you can see:
Where the items in items.py act as containers to load the crawled data, and it is structured like a dictionary in Python. Open your items.py to see the following code:
Name = Scrapy. Field () is a typical item that is used as a container for the name item being crawled
Class Newoneitem (Scrapy. Item):
# define the fields for your item here like:
# name = Scrapy. Field () Pass
Describes the container for storing crawled data, so how do you crawl data? Before that, we need some knowledge of XPath.
XPath tutorials
Take a few simple examples to take a look at XPath usage:/html/head/title Select the title element under the head element of the directory HTML/html/head/title/text () Select the text content of the title element//td Select all TD element//div[@class = "Mine"] Select all attributes class= "Mine" div elements
In order to facilitate the use of xpaths,scrapy to provide selector classes, within Scrapy, there are four basic methods of selectors: XPath (): Return a series of selectors, Each select represents an XPath parameter expression selected node CSS (): Returns a series of selectors, each select represents a node of a CSS parameter expression selection extract (): Returns a Unicode string for the selected data re ( ): Returns a string of Unicode strings for content crawled using regular expressions
Get a deeper understanding of a practical example.
Use the following Web site as an example
http://www.dmoz.org/Computers/Programming/Languages/Python/Books/
Use the shell to crawl Web pages to observe the functionality of XPath
On the command line, enter:
After the shell is loaded, you will get a response response, stored in the local variable response.
So if you enter Response.body, you will see the body part of the response, which is the content of the page being crawled:
Similarly, you can enter Response.headers:
Next look at the Web page:
Let's crawl through the Computers,programming,languages,python elements and right-click on the browser to see the HTML code.
Write from this code
Sel.xpath ('//a[@class = "breadcrumb"]/text ()). Extract ()
Where the SEL is the selector object
XPath is flexible enough to produce the same result in different ways.
Next we need to write the actual crawl data code, we can learn from the actual examples, please look forward to ing