(Suggest everyone to read more about the official website tutorial: Tutorial address)
We use the dmoz.org site as a small grab to catch a show of skill.
First you have to answer a question.
Q: Put the Web site into a reptile, a total of several steps.
The answer is simple, step four: New Project (Project): Create a new reptile project clear goal (items): Define the target you want to crawl (Spider): Make crawler start crawl Web page storage content (Pipeline): Design Pipeline Store crawl content
OK, now that the basic process is OK, you can do it step-by-step.
1. New Item (Project)
Hold down the SHIFT key in the empty directory and select "Open Command Window Here" and enter the command:
Scrapy Startproject Tutorial
Where tutorial is the project name.
You can see that a tutorial folder will be created with the following directory structure:
tutorial/
scrapy.cfg
tutorial/
__init__.py
items.py
pipelines.py spiders/
__init__.py
...
Here's a brief introduction to the role of each file: Scrapy.cfg: Project configuration file
tutorial/: The Python module for the project, which will refer to code from here tutorial/items.py: Project's Items file tutorial/pipelines.py: Pipelines file for the project tutorial/ settings.py: Project Settings file tutorial/spiders/: directory where reptiles are stored
2. Clear Objectives (Item)
In Scrapy, items are containers that are used to load crawled content, somewhat like DiC in Python, a dictionary, but provide some additional protection-reduction errors.
In general, the item can be created with the Scrapy.item.Item class, and the Scrapy.item.Field object is used to define the attribute (which can be understood to resemble an ORM mapping relationship).
Next, we start to build the item model.
First of all, we want to have the following: Name link (URL) description (description)
Modify the items.py file in the Tutorial directory and add our own class after the original class.
Because we want to capture the content of the dmoz.org website, we can name it dmozitem:
# Define Here's the models for your scraped items # # to
documentation in:
# http://doc.scrapy.org/en/latest/to pics/items.html
from Scrapy.item Import Item, Field
class Tutorialitem (item):
# define the fields for your IT EM here like:
# name = Field ()
Pass
class Dmozitem (Item):
title = field ()
link = field ()
desc = Field ()
At first it may seem a bit difficult to understand, but defining these items allows you to know exactly what your item is when you use other components.
You can simply interpret the item as a packaged class object.
3. Making Reptiles (Spider)
Make a reptile, the whole is divided into two steps: First crawl and then take.
In other words, first you have to get all the content of the entire page and then take out the parts that are useful to you.
3.1 Crawl
Spider is a class that users write themselves to grab information from a domain (or group of domains).
They define a list of URLs to download, a way to track links, and ways to parse the content of a Web page to extract items.
To establish a spider, you must create a subclass with Scrapy.spider.BaseSpider and determine the three mandatory attributes: Name: The name of the reptile must be unique, and you must define different names in different reptiles. Start_urls: List of crawled URLs. The crawler starts to crawl data from here, so the first data downloaded will start with these URLs. Other child URLs will inherit from these starting URLs. Parse (): The parsed method, when invoked, passes in the response object returned from each URL as a unique parameter that resolves and matches the crawled data (resolves to item) and tracks more URLs.
Here you can refer to the ideas mentioned in the Width crawler tutorial to help understand the tutorial delivery: [Java] knows Chin 5th: Using the HttpClient Toolkit and width crawler.
That is, the URL is stored down and gradually spread away from the beginning, crawl all eligible Web page URLs stored up to continue crawling.
Here we write the first reptile, named Dmoz_spider.py, in the Tutorial\spiders directory.
The dmoz_spider.py code is as follows:
From Scrapy.spider import Spider
class Dmozspider (spider):
name = "DMOZ"
allowed_domains = ["dmoz.org"]
start_urls = [
"http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
"http:// Www.dmoz.org/Computers/Programming/Languages/Python/Resources/"
]
def parse (self, response):
filename = Response.url.split ("/") [-2]
open (filename, ' WB '). Write (Response.body)
Allow_domains is the domain name range of the search, which is the restricted area of the reptile, which stipulates that the crawler only crawls the Web page under this domain name.
As can be seen from the parse function, the last two addresses of the links are removed as file names for storage.
Then run a look at the Tutorial directory, hold down SHIFT right click, open the Command window here, and enter:
Scrapy Crawl DMOZ
Run the results as shown in figure:
An error has been made:
Unicodedecodeerror: ' ASCII ' codec can ' t decode byte 0xb0 in position 1:ordinal not in range (128)
Running the first scrapy project is an error.
There should be a coding problem, Google has found the solution:
Under the Python lib\site-packages folder, create a new sitecustomize.py:
Import sys
sys.setdefaultencoding (' gb2312 ')
Run again, OK, problem solved, look at the results:
The last sentence info:closing spider (finished) indicates that the crawler has successfully run and shut itself down.
The rows that contain [DMOZ], which correspond to the results of our reptilian operation.
You can see that there are journal lines for each URL defined in Start_urls.
Do you remember our start_urls.
Http://www.dmoz.org/Computers/Programming/Languages/Python/Books
Http://www.dmoz.org/Computers/Programming/Languages/Python/Resources
Because these URLs are starting pages, they are not referenced (referrers), so you will see them at the end of each line (Referer: <None>).
Under the Parse method, two files are created: books and resources, and the two files have URL page content.
So what happened in the midst of the thunder and lightning just now?
First, Scrapy creates a Scrapy.http.Request object for each URL in the reptile's Start_urls attribute and assigns the crawler's Parse method as a callback function.
The request is then dispatched and executed, then returns the Scrapy.http.Response object via the parse () method and feeds back to the crawler.
3.2 Take
Crawl through the entire page, the next is the process of fetching.
It's not enough to store an entire Web page.
In the base crawler, this step can be caught with regular expressions.
In Scrapy, a mechanism called XPath selectors is used, which is based on an XPath expression.
If you want to learn more about selectors and other mechanisms you can access the information: Dot I dot my
Here are some examples of XPath expressions and their meanings/html/head/title: Select the <title> tag below the HTML document
Scrapy Shell http://www.dmoz.org/Computers/Programming/Languages/Python/Books/
When you enter, you can see the following:
After the shell is loaded, you will get a response response, stored in the local variable response.
So if you enter Response.body, you will see the body part of the response, which is the content of the page being crawled:
or enter Response.headers to see its header section:
Now it's like a lot of sand in your hand, and it hides the gold that we want, so the next step is to shake it with a sieve and get the impurities out and pick out the key content.
Selector is such a sieve.
In the old version, the shell instantiates two types of selectors, one is the HXS variable parsing the HTML, and the other is the XXS variable that parses the XML.
And now the shell is ready for us the selector object, SEL, can automatically choose the best solution (XML or HTML) based on the data type returned.
Then we'll crunching. ~
To get a thorough understanding of the problem, first you need to know what the captured page is like.
For example, we're going to crawl the title of the page, which is the <title> tag:
You can enter:
Sel.xpath ('//title ')
The result is:
This can be taken out of the tag, with extract () and text () can also be further processed.
Note: Simply list the useful XPath path expressions:
An expression |
Description |
NodeName |
Select all child nodes of this node. |
/ |
Select from the root node. |
// |
Select the nodes in the document from the current node that matches the selection, regardless of their location. |
. |
Select the current node. |
.. |
Select the parent node of the current node. |
@ |
Select the attribute. |
All the experimental results are as follows, In[i] represents the input of the first experiment, Out[i] represents the output of the results of the first (recommended by the following: The world-edition tutorial):
In [1]: Sel.xpath ('//title ')
out[1]: [<selector xpath= '//title data=u ' <title>open] directory-computers: Progr '] in
[2]: Sel.xpath ('//title '). Extract ()
out[2]: [u ' <title>open directory-computers: Programming:languages:python:books</title> '] in
[3]: Sel.xpath ('//title/text () ')
out[3]: [< Selector xpath= '//title/text () ' Data=u ' Open directory-computers:programming: ';] in
[4]: Sel.xpath ('//title/ Text () "). Extract ()
out[4]: [u ' Open directory-computers:programming:languages:python:books '] in
[5]: Sel.xpath ('//title/text () '). Re (' (\w+): ')
out[5]: [u ' Computers ', U ' programming ', U ' Languages ', U ' Python ']
Of course the title of the label for us does not have much value, below we will really grab some meaningful things.
Using Firefox's review element we can clearly see that what we need is as follows:
We can use the following code to crawl this <li> tag:
Sel.xpath ('//ul/li ')
From the <li> tab, you can get a description of the site:
Sel.xpath ('//ul/li/text () '). Extract ()
You can get the title of the site like this:
Sel.xpath ('//ul/li/a/text () '). Extract ()
You can get a hyperlink to a Web site like this:
Sel.xpath ('//ul/li/a/@href '). Extract ()
Of course, the preceding examples are methods for getting properties directly.
We noticed that XPath returned a list of objects,
Then we can also directly invoke the properties of the object in this list to dig deeper nodes
(Reference: Nesting selectors andworking with relative xpaths in the selectors):
Sites = Sel.xpath ('//ul/li ') for
site in sites:
title = Site.xpath (' A/text () '). Extract ()
link = site.xpath ( ' A/@href '). Extract ()
desc = Site.xpath (' text () '). Extract ()
print title, LINK, desc
3.4xpath Combat
We've been using the shell for so long that we can finally apply what we've learned to dmoz_spider this reptile.
Make the following modifications in the parse function of the original crawler:
From Scrapy.spider import spider from
scrapy.selector import selector
class Dmozspider (spider):
name = " Dmoz "
allowed_domains = [" dmoz.org "]
start_urls = [
" http://www.dmoz.org/Computers/Programming/ languages/python/books/","
http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"
]
def parse (self, Response):
sel = Selector (response)
sites = Sel.xpath ('//ul/li ') for
site in sites:< C12/>title = Site.xpath (' A/text () '). Extract ()
link = site.xpath (' @href '). Extract () desc
= Site.xpath (' Text () "). Extract ()
print title
Note that we imported the selector class from Scrapy.selector and instantiated a new selector object. So we can manipulate XPath in the same way as the shell.
Let's try typing the command to run the crawler (inside the tutorial root directory):
Scrapy Crawl DMOZ
The results of the operation are as follows:
Sure enough, I succeeded in catching all the headlines. But it seems not quite right ah, how top,python this navigation bar also crawled out of it.
We only need the contents of the red circle:
There seems to be something wrong with our XPath statements, not just grabbing the names of the items we need, but also grabbing some of the innocent but XPath syntax elements.
Review elements we find we need <ul> have class= ' directory-url ' attributes,
So just change the XPath statement to Sel.xpath ('//ul[@class = ' directory-url ']/li ')
Adjust the XPath statement as follows:
From Scrapy.spider import spider from
scrapy.selector import selector
class Dmozspider (spider):
name = " Dmoz "
allowed_domains = [" dmoz.org "]
start_urls = [
" http://www.dmoz.org/Computers/Programming/ languages/python/books/","
http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"
]
def parse (self, Response):
sel = Selector (response)
sites = Sel.xpath ('//ul[@class = ' Directory-url ']/ Li ') for
site in sites:
title = Site.xpath (' A/text () '). Extract ()
link = site.xpath (' A/@href '). Extract ( )
desc = Site.xpath (' text () '). Extract ()
print title
Succeeded in grasping all the headlines, absolutely not killing innocents:
3.5 Use Item