0 Basic writing Python crawler using scrapy framework to write crawler _python

Source: Internet
Author: User
Tags documentation xpath in python

Web crawler, is the process of data crawling on the web, use it to crawl specific pages of HTML data. Although we use some libraries to develop a crawler program, the use of frameworks can greatly improve efficiency and shorten development time. Scrapy is written in Python, lightweight, simple and lightweight, and easy to use. The use of scrapy can be very convenient to complete the collection of online data, it has done a lot of work for us, and do not need their own effort to develop.

First you have to answer a question.
Q: Put the site into a reptile, a total of a few steps?
The answer is simple, four steps:
New Project (Project): Create a new reptile project
Clear goals (Items): Identify the goals you want to crawl
Making Reptiles (Spider): Making reptiles start crawling Web pages
Storage content (Pipeline): Designing Pipeline Storage Crawl Content

OK, now that the basic process is OK, you can do it step-by-step.

1. New Item (Project)
Hold down the SHIFT key in the empty directory and select "Open Command Window Here" and enter the command:

Copy Code code as follows:

Scrapy Startproject Tutorial

Where tutorial is the project name.
You can see that a tutorial folder will be created with the following directory structure:

Copy Code code as follows:

tutorial/
Scrapy.cfg
tutorial/
__init__.py
items.py
pipelines.py
settings.py
spiders/
__init__.py
...

Here's a quick overview of how each file works:
SCRAPY.CFG: Project configuration file
tutorial/: The Python module of the project, which will refer to code from here
tutorial/items.py: Item files for the project
tutorial/pipelines.py: Pipelines file for the project
tutorial/settings.py: Project Settings file
tutorial/spiders/: directory for storing reptiles

2. Clear objectives (Item)
In Scrapy, items are containers that are used to load crawled content, somewhat like DiC in Python, a dictionary, but provide some additional protection-reduction errors.
In general, the item can be created with the Scrapy.item.Item class, and the Scrapy.item.Field object is used to define the attribute (which can be understood to resemble an ORM mapping relationship).
Next, we start to build the item model.
First of all, what we want is:
Names (name)
Link (URL)
Description (description)

Modify the items.py file in the Tutorial directory and add our own class after the original class.
Because we want to capture the content of the dmoz.org website, we can name it dmozitem:

Copy Code code as follows:

# Define The models for your scraped items
#
# documentation in:
# http://doc.scrapy.org/en/latest/topics/items.html

From Scrapy.item Import Item, Field

Class Tutorialitem (Item):
# define the fields for your item here:
# name = Field ()
Pass

Class Dmozitem (Item):
title = Field ()
link = Field ()
desc = Field ()

At first it may seem a bit difficult to understand, but defining these items allows you to know exactly what your item is when you use other components.
You can simply interpret the item as a packaged class object.

3. Making Reptiles (Spider)

Make a reptile, the whole is divided into two steps: First crawl and then take.
In other words, first you have to get all the content of the entire page and then take out the parts that are useful to you.
3.1 Crawl
Spider is a class that users write themselves to grab information from a domain (or group of domains).
They define a list of URLs to download, a way to track links, and ways to parse the content of a Web page to extract items.
To build a spider, you have to create a subclass with Scrapy.spider.BaseSpider and determine the three mandatory attributes:
Name: The name of the reptile must be unique, and in different reptiles you must define different names.
Start_urls: List of crawled URLs. The crawler starts to crawl data from here, so the first data downloaded will start with these URLs. Other child URLs will inherit from these starting URLs.
Parse (): The parsed method, when invoked, passes in the response object returned from each URL as a unique parameter that resolves and matches the crawled data (resolves to item) and tracks more URLs.

Here you can refer to the ideas mentioned in the Width crawler tutorial to help understand the tutorial delivery: [Java] knows Chin 5th: Using the HttpClient Toolkit and width crawler.
That is, the URL is stored down and gradually spread away from the beginning, crawl all eligible Web page URLs stored up to continue crawling.

Here we write the first reptile, named Dmoz_spider.py, in the Tutorial\spiders directory.
The dmoz_spider.py code is as follows:

Copy Code code as follows:

From Scrapy.spider import spider

Class Dmozspider (Spider):
Name = "DMOZ"
Allowed_domains = ["dmoz.org"]
Start_urls = [
"Http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
"Http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"
]

Def parse (self, Response):
filename = Response.url.split ("/") [-2]
Open (filename, ' WB '). Write (Response.body)

Allow_domains is the domain name range of the search, which is the restricted area of the reptile, which stipulates that the crawler only crawls the Web page under this domain name.
As can be seen from the parse function, the last two addresses of the links are removed as file names for storage.
Then run a look at the Tutorial directory, hold down SHIFT right click, open the Command window here, and enter:

Copy Code code as follows:

Scrapy Crawl DMOZ

Run the results as shown in figure:

An error has been made:
Unicodedecodeerror: ' ASCII ' codec can ' t decode byte 0xb0 in position 1:ordinal not in range (128)
Running the first scrapy project is an error.
There should be a coding problem, Google has found the solution:
Under the Python lib\site-packages folder, create a new sitecustomize.py:

Copy Code code as follows:

Import Sys
Sys.setdefaultencoding (' gb2312 ')

Run again, OK, problem solved, look at the results:

The last sentence info:closing spider (finished) indicates that the crawler has successfully run and shut itself down.
The rows that contain [DMOZ], which correspond to the results of our reptilian operation.
You can see that there are journal lines for each URL defined in Start_urls.
Do you remember our start_urls?
Http://www.dmoz.org/Computers/Programming/Languages/Python/Books
Http://www.dmoz.org/Computers/Programming/Languages/Python/Resources
Because these URLs are starting pages, they are not referenced (referrers), so you will see them at the end of each line (Referer: <None>).
Under the Parse method, two files are created: books and resources, and the two files have URL page content.

So what happened in the midst of the thunder and lightning just now?
First, Scrapy creates a Scrapy.http.Request object for each URL in the reptile's Start_urls attribute and assigns the crawler's Parse method as a callback function.
The request is then dispatched and executed, then returns the Scrapy.http.Response object via the parse () method and feeds back to the crawler.

3.2 Take
Crawl through the entire page, the next is the process of fetching.
It's not enough to store an entire Web page.
In the base crawler, this step can be caught with regular expressions.
In Scrapy, a mechanism called XPath selectors is used, which is based on an XPath expression.
If you want to learn more about selectors and other mechanisms you can access the information: Dot I dot my

Here are some examples of XPath expressions and what they mean.
/html/head/title: Select the <title> tag below the HTML document /html/head/title/text (): Select the text content below the <title> element mentioned above
TD: Select all <td> elements
div[@class = "Mine"]: Select all div tag elements that contain the class= "Mine" attribute
These are just a few simple examples of using XPath, but in fact XPath is very powerful.
You can refer to the context of the consortium: Click Me to point me.

To facilitate the use of xpaths,scrapy to provide xpathselector classes, there are two options, Htmlxpathselector (HTML data Parsing) and Xmlxpathselector (XML data parsing).
They must be instantiated with a Response object.
You will find that the Selector object shows the node structure of the document. Therefore, the first instantiated selector must be related to the root node or the entire directory.
Within Scrapy, selectors has four basic methods (click to view API documentation):
XPath (): Returns a series of selectors, each of which represents an XPath parameter-expression-selected node
CSS (): Returns a series of selectors, each of which represents a node selected by a CSS parameter expression
Extract (): Returns a Unicode string for the selected data
Re (): Returns a string of Unicode strings for content crawled using regular expressions

3.3xpath Experiment
Now let's try selector in the shell.
Website of experiment: http://www.dmoz.org/Computers/Programming/Languages/Python/Books/

The mouse, who is familiar with the experiment, then crawls the web with the shell.
Enter the top level directory of the project, which is the first Level Tutorial folder, and enter in CMD:

Copy Code code as follows:

Scrapy Shell http://www.dmoz.org/Computers/Programming/Languages/Python/Books/

When you enter, you can see the following:

After the shell is loaded, you will get a response response, stored in the local variable response.
So if you enter Response.body, you will see the body part of the response, which is the content of the page being crawled:

or enter Response.headers to see its header section:

Now it's like a lot of sand in your hand, and it hides the gold that we want, so the next step is to shake it with a sieve and get the impurities out and pick out the key content.
Selector is such a sieve.
In the old version, the shell instantiates two types of selectors, one is the HXS variable parsing the HTML, and the other is the XXS variable that parses the XML.
And now the shell is ready for us the selector object, SEL, can automatically choose the best solution (XML or HTML) based on the data type returned.
Then we'll crunching! ~
To get a thorough understanding of the problem, first you need to know what the captured page is like.
For example, we're going to crawl the title of the page, which is the <title> tag:

You can enter:

Copy Code code as follows:

Sel.xpath ('//title ')

The result is:

This can be taken out of the tag, with extract () and text () can also be further processed.
Note: Simply list the useful XPath path expressions:
Description of expression
NodeName selects all child nodes of this node.
/Selected from the root node.
Select the nodes in the document from the current node that matches the selection, regardless of their location.
. Select the current node.
.. Select the parent node of the current node.
@ Select Attributes.
All the experimental results are as follows, In[i] represents the input of the first experiment, Out[i] represents the output of the results of the first (recommended by the following: The world-edition tutorial):

Copy Code code as follows:

In [1]: Sel.xpath ('//title ')
OUT[1]: [<selector xpath= '//title ' data=u ' <title>open directory-computers:progr ']

In [2]: Sel.xpath ('//title '). Extract ()
OUT[2]: [u ' <title>open directory-computers:programming:languages:python:books</title> ']

In [3]: Sel.xpath ('//title/text () ')
OUT[3]: [<selector xpath= '//title/text () ' Data=u ' Open directory-computers:programming: ']

In [4]: Sel.xpath ('//title/text () '). Extract ()
OUT[4]: [u ' Open directory-computers:programming:languages:python:books ']

In [5]: Sel.xpath ('//title/text () '). Re (' (\w+): ')
OUT[5]: [u ' Computers ', U ' programming ', U ' Languages ', U ' Python ']

Of course the title of the label for us does not have much value, below we will really grab some meaningful things.
Using Firefox's review element we can clearly see that what we need is as follows:

We can use the following code to crawl this <li> tag:

Copy Code code as follows:

Sel.xpath ('//ul/li ')

From the <li> tab, you can get a description of the site:

Copy Code code as follows:

Sel.xpath ('//ul/li/text () '). Extract ()

You can get the title of the site like this:

Copy Code code as follows:

Sel.xpath ('//ul/li/a/text () '). Extract ()

You can get a hyperlink to a Web site like this:

Copy Code code as follows:

Sel.xpath ('//ul/li/a/@href '). Extract ()

Of course, the preceding examples are methods for getting properties directly.
We noticed that XPath returned a list of objects,
Then we can also directly invoke the properties of the object in this list to dig deeper nodes
(Reference: Nesting selectors andworking with relative xpaths in the selectors):
Sites = Sel.xpath ('//ul/li ')
For site in sites:
title = Site.xpath (' A/text () '). Extract ()
link = site.xpath (' a @href '). Extract ()
desc = Site.xpath (' text () '). Extract ()
Print title, LINK, desc

3.4xpath Combat
We've been using the shell for so long that we can finally apply what we've learned to dmoz_spider this reptile.
Make the following modifications in the parse function of the original crawler:

Copy Code code as follows:

From Scrapy.spider import spider
From Scrapy.selector import Selector

Class Dmozspider (Spider):
Name = "DMOZ"
Allowed_domains = ["dmoz.org"]
Start_urls = [
"Http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
"Http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"
]

Def parse (self, Response):
SEL = Selector (response)
Sites = Sel.xpath ('//ul/li ')
For site in sites:
title = Site.xpath (' A/text () '). Extract ()
link = site.xpath (' a @href '). Extract ()
desc = Site.xpath (' text () '). Extract ()
Print title

Note that we imported the selector class from Scrapy.selector and instantiated a new selector object. So we can manipulate XPath in the same way as the shell.
Let's try typing the command to run the crawler (inside the tutorial root directory):

Copy Code code as follows:

Scrapy Crawl DMOZ

The results of the operation are as follows:

Sure enough, I succeeded in catching all the headlines. But it seems not quite right ah, how top,python this navigation bar also crawled out of it?
We only need the contents of the red circle:

There seems to be something wrong with our XPath statements, not just grabbing the names of the items we need, but also grabbing some of the innocent but XPath syntax elements.
Review elements we find we need <ul> have class= ' directory-url ' attributes,
So just change the XPath statement to Sel.xpath ('//ul[@class = ' directory-url ']/li ')
Adjust the XPath statement as follows:

Copy Code code as follows:

From Scrapy.spider import spider
From Scrapy.selector import Selector

Class Dmozspider (Spider):
Name = "DMOZ"
Allowed_domains = ["dmoz.org"]
Start_urls = [
"Http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
"Http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"
]

Def parse (self, Response):
SEL = Selector (response)
Sites = Sel.xpath ('//ul[@class = ' directory-url ']/li ')
For site in sites:
title = Site.xpath (' A/text () '). Extract ()
link = site.xpath (' a @href '). Extract ()
desc = Site.xpath (' text () '). Extract ()
Print title

Succeeded in grasping all the headlines, absolutely not killing innocents:

3.5 Use Item
Next we'll look at how to use item.
As we said earlier, the Item object is a custom Python dictionary, and you can use the standard dictionary syntax to get the value of a property:

Copy Code code as follows:

>>> item = Dmozitem ()
>>> item[' title ' = ' Example title '
>>> item[' title ']
' Example title '

As a reptile, spiders wants to be able to store the data it captures into the item object. In order to return to our crawl data, Spider's final code should be this:

Copy Code code as follows:

From Scrapy.spider import spider
From Scrapy.selector import Selector

From Tutorial.items import Dmozitem

Class Dmozspider (Spider):
Name = "DMOZ"
Allowed_domains = ["dmoz.org"]
Start_urls = [
"Http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
"Http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"
]

Def parse (self, Response):
SEL = Selector (response)
Sites = Sel.xpath ('//ul[@class = ' directory-url ']/li ')
Items = []
For site in sites:
item = Dmozitem ()
item[' title '] = Site.xpath (' A/text () '). Extract ()
item[' link ' = Site.xpath (' A/@href '). Extract ()
item[' desc ' = Site.xpath (' text () '). Extract ()
Items.append (item)
return items

4. Storage content (Pipeline)
The easiest way to save information is through the feed exports, there are four main kinds: Json,json lines,csv,xml.
We export the results in the most commonly used JSON, and the commands are as follows:

Copy Code code as follows:

Scrapy Crawl Dmoz-o items.json-t JSON

-O is followed by the export file name followed by the export type.
Then look at the results of the export, open the JSON file with a text editor (for ease of display, delete the attribute except title in the item):

Because this is just a small example, so this simple processing is OK.
If you want to do more complicated things with crawled items, you can write an item Pipeline (entry pipeline).
We'll play the ^_^ later.

This is the Python crawler framework scrapy make crawler crawl site content of the whole process, very detailed it, hope to be able to help everyone, if necessary, can also contact me, progress

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.