0 Base Write Python crawler using scrapy framework to write crawler

Source: Internet
Author: User
Tags xpath

A web crawler is a program that crawls data on the web and uses it to crawl the HTML data of a particular webpage. While we use some libraries to develop a crawler, using frameworks can greatly improve efficiency and shorten development time. Scrapy is written in Python, lightweight, simple and lightweight, and very handy to use. The use of scrapy can be very convenient to complete the collection of online data, it has done a lot of work for us, without the need for their own efforts to develop.

First, we need to answer a question.
Q: How many steps are there to put a website into a reptile?
The answer is simple, four steps:
New Project (Project): Create a new crawler project
Clear goals (Items): Identify the target you want to crawl
Spider: Making crawlers start crawling Web pages
Storage content (Pipeline): Design Pipeline Store crawl content

OK, now that the basic process is determined, the next step is to complete it.

1. New Project (Project)
In the empty directory, hold down the SHIFT key and right-click, select "Open Command Window Here" and enter the command:


Scrapy Startproject Tutorial

Where tutorial is the project name.
You can see that a tutorial folder will be created with the following directory structure:


The code is as follows:
tutorial/
Scrapy.cfg
tutorial/
__init__.py
items.py
pipelines.py
settings.py
spiders/
__init__.py
...


Here's a brief look at the role of each file:
SCRAPY.CFG: Configuration file for Project
tutorial/: The project's Python module, which will reference the code from here
tutorial/items.py: Project Items file
tutorial/pipelines.py: Project's Pipelines file
tutorial/settings.py: Setup file for Project
tutorial/spiders/: Directory for crawler storage

2. Clear Target (Item)
In Scrapy, items is a container for loading crawling content, a bit like dic in Python, which is a dictionary, but provides some extra protection to reduce errors.
In general, item can be created with the Scrapy.item.Item class, and attributes are defined using Scrapy.item.Field objects (which can be understood as an ORM-like mapping relationship).
Next, we start to build the item model.
First of all, what we want is:
Names (name)
Link (URL)
Description (description)

Modify the items.py file in the Tutorial directory to add our own class after the original class.
Because we want to capture the content of the dmoz.org site, we can name it dmozitem:

The code is as follows:
# Define Here the models for your scraped items
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/items.html

From Scrapy.item Import Item, Field

Class Tutorialitem (Item):
# define the fields for your item here is like:
# name = Field ()
Pass

Class Dmozitem (Item):
title = Field ()
link = Field ()
desc = Field ()

At first it may seem a bit out of the point, but defining these items will allow you to use other components to know what's in your item.
The item can be simply understood as a encapsulated class object.

3. Making Crawlers (spider)

The production of reptiles, the overall two-step: first crawl and then take.
In other words, first you want to get all the content of the entire page, and then take out the parts that are useful to you.
3.1 Climb
Spiders are classes that users write themselves to crawl information from a domain (or domain group).
They define a list of URLs to download, a scheme for tracking links, and a way to parse the content of the Web page to extract items.
To build a spider, you must create a subclass with Scrapy.spider.BaseSpider and determine the three mandatory properties:
Name: The names of the crawlers must be unique, and you must define different names in different reptiles.
Start_urls: List of crawled URLs. Crawlers start crawling data from here, so the first downloaded data will start with these URLs. Other sub-URLs will be generated from these starting URLs for inheritance.
Parse (): The parsing method, when called, passes in the response object returned from each URL as the only parameter, responsible for parsing and matching the crawled data (parsing to item) and tracking more URLs.

Here you can refer to the idea mentioned in the Width crawler tutorial to help understand, tutorial delivery: [Java] The chin 5th episode: Using the HttpClient Toolkit and the width crawler.
That is, the URL is stored and as a starting point to gradually spread out, crawl all eligible Web page URLs stored up to continue crawling.

Let's write the first crawler, named dmoz_spider.py, saved in the Tutorial\spiders directory.
The dmoz_spider.py code is as follows:

The code is as follows:
From Scrapy.spider import spider

Class Dmozspider (Spider):
Name = "DMOZ"
Allowed_domains = ["dmoz.org"]
Start_urls = [
"Http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
"Http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"
]

Def parse (self, Response):
filename = Response.url.split ("/") [-2]
Open (filename, ' WB '). Write (Response.body)

Allow_domains is the domain name range of the search, that is, the crawler's constrained area, which specifies that crawlers crawl only the Web page under this domain name.
As can be seen from the parse function, the last two addresses of the link are taken out as file names for storage.
Then run a look, hold down SHIFT right click in the Tutorial directory, open the command window here, enter:

The code is as follows:
Scrapy Crawl DMOZ

Run results

The error:
Unicodedecodeerror: ' ASCII ' codec can ' t decode byte 0xb0 in position 1:ordinal not in range (128)
Running the first scrapy project is a real ill-fated error.
Should be out of the coding problem, Google a bit to find a solution:
Create a new sitecustomize.py under the Python lib\site-packages folder:

The code is as follows:
Import Sys
Sys.setdefaultencoding (' gb2312 ')

Run again, OK, problem solved, look at the results:

The last sentence of the info:closing spider (finished) indicates that the crawler has run successfully and shuts itself down.
The line containing [DMOZ], which corresponds to the result of our crawler running.
You can see that each URL defined in Start_urls has a journal line.
Do you remember our start_urls?
Http://www.dmoz.org/Computers/Programming/Languages/Python/Books
Http://www.dmoz.org/Computers/Programming/Languages/Python/Resources
Because these URLs are the start page, they don't have references (referrers), so you'll see them at the end of each line (Referer: <None>).
In the context of the parse method, two files were created: Books and Resources, which have URL content in the two files.

So what happened in the Thunder and lightning?
First, Scrapy creates a Scrapy.http.Request object for each URL in the crawler's Start_urls property, and designates the crawler's parse method as a callback function.
The request is then dispatched and executed, followed by the parse () method to return the Scrapy.http.Response object and feed back to the crawler.

3.2 Take
Crawl the entire page, the next step is to take the process.
It's not enough to store a whole web page in light.
In the underlying crawler, this step can be captured with regular expressions.
In Scrapy, a mechanism called XPath selectors is used, based on an XPath expression.
If you want to learn more about selectors and other mechanisms you can check out the information: Dot Me dot Me

Here are some examples of XPath expressions and their meanings
/html/head/title: Select the <title> tag below the HTML document /html/head/title/text (): Select the text content below the <title> element mentioned above
TD: Select all <td> elements
div[@class = "Mine"]: Select all div tag elements that contain the class= "Mine" attribute
These are just a few simple examples of using XPath, but in fact XPath is very powerful.
You can refer to the book: Dot me.

To facilitate the use of xpaths,scrapy to provide xpathselector classes, there are two options, Htmlxpathselector (parsing of HTML data) and Xmlxpathselector (parsing XML data).
They must be instantiated through a Response object.
You will find that the Selector object shows the node structure of the document. Therefore, the first instantiated selector must be related to the root node or the entire directory.
In Scrapy, there are four basic methods of selectors (click to view API documentation):
XPath (): Returns a series of selectors, each of which represents the node selected by an XPath parameter expression
CSS (): Returns a list of selectors, each select node that represents a CSS parameter expression selection
Extract (): Returns a Unicode string for the selected data
Re (): Returns a string of Unicode strings that are crawled for use with regular expressions

3.3xpath Experiment
Let's try the use of selector in the shell.
Site of the experiment: http://www.dmoz.org/Computers/Programming/Languages/Python/Books/

Familiar with the experiment of the mice, the next is to use the shell crawl Web pages.
Enter the top-level directory of the project, which is the first-level tutorial folder, in cmd:

The code is as follows:
Scrapy Shell http://www.dmoz.org/Computers/Programming/Languages/Python/Books/

When you enter, you can see the following content:

After the shell is loaded, you will get a response response, stored in the local variable response.
So if you enter Response.body, you will see the body part of response, which is the content of the crawled page:

or enter Response.headers to view its header section:

Now it's like a lot of sand in your hand, and it hides the gold we want, so the next step is to use a sieve to shake the impurities out and pick out the key content.
Selector is such a sieve.
In the old version, the shell instantiates two kinds of selectors, one is the HXS variable that parses the HTML, and the other is the XXS variable that parses the XML.
And now the shell prepares us for the selector object, SEL, which automatically chooses the best parsing scheme (XML or HTML) based on the type of data returned.
Then we'll crunching! ~
To get a thorough understanding of this problem, first of all, you have to know what the captured page looks like.
For example, we want to crawl the title of the page, that is, <title> this tag:

You can enter:

The code is as follows:
Sel.xpath ('//title ')

The result is:

This allows the label to be removed, and further processing can be done with extract () and text ().
Note: A simple list of useful XPath path expressions:
Expression description
NodeName selects all child nodes of this node.
/select from the root node.
Selects the nodes in the document from the current node that matches the selection, regardless of their location.
. Select the current node.
.. Selects the parent node of the current node.
@ Select Properties.
The results of the experiment are as follows, In[i] indicates the input of the first experiment, Out[i] indicates the output of the results of the first I (Refer to: World Series):

The code is as follows:
In [1]: Sel.xpath ('//title ')
OUT[1]: [<selector xpath= '//title ' data=u ' <title>open directory-computers:progr ';]

In [2]: Sel.xpath ('//title '). Extract ()
OUT[2]: [u ' <title>open directory-computers:programming:languages:python:books</title> ']

In [3]: Sel.xpath ('//title/text () ')
OUT[3]: [<selector xpath= '//title/text () ' Data=u ' Open directory-computers:programming: ';]

In [4]: Sel.xpath ('//title/text () '). Extract ()
OUT[4]: [u ' Open directory-computers:programming:languages:python:books ']

In [5]: Sel.xpath ('//title/text () '). Re (' (\w+): ')
OUT[5]: [u ' Computers ', U ' programming ', U ' Languages ', U ' Python ']

Of course the title of this tag for us is not too much value, below we will really crawl some meaningful things.
Using Firefox's review element we can clearly see that what we need is as follows:

We can use the following code to grab this <li> tag:

The code is as follows:
Sel.xpath ('//ul/li ')

From the <li> tab, you can get a description of the site like this:

The code is as follows:
Sel.xpath ('//ul/li/text () '). Extract ()

You can get the title of the Web site this way:

The code is as follows:
Sel.xpath ('//ul/li/a/text () '). Extract ()

You can get a hyperlink to a Web site like this:

The code is as follows:
Sel.xpath ('//ul/li/a/@href '). Extract ()

Of course, the preceding examples are methods of directly acquiring properties.
We notice that the XPath returns a list of objects,
Then we can also directly invoke the properties of the objects in this list to dig deeper nodes
(Reference: Nesting selectors andworking with relative xpaths in the selectors):
Sites = Sel.xpath ('//ul/li ')
For site in sites:
title = Site.xpath (' A/text () '). Extract ()
link = site.xpath (' a @href '). Extract ()
desc = Site.xpath (' text () '). Extract ()
Print title, LINK, desc

3.4xpath Combat
We have been using the shell for such a long time, and finally we can apply the content of the previous learning to the Dmoz_spider crawler.
In the original crawler's parse function, make the following changes:

The code is as follows:
From Scrapy.spider import spider
From Scrapy.selector import Selector

Class Dmozspider (Spider):
Name = "DMOZ"
Allowed_domains = ["dmoz.org"]
Start_urls = [
"Http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
"Http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"
]

Def parse (self, Response):
SEL = Selector (response)
Sites = Sel.xpath ('//ul/li ')
For site in sites:
title = Site.xpath (' A/text () '). Extract ()
link = site.xpath (' a @href '). Extract ()
desc = Site.xpath (' text () '). Extract ()
Print title

Note that we imported the selector class from Scrapy.selector and instantiated a new selector object. So that we can manipulate the XPath like in the shell.
Let's try to enter a command to run the crawler (in the tutorial root directory):

The code is as follows:
Scrapy Crawl DMOZ

The results of the operation are as follows:

Sure enough, the success caught all the headlines. But it seems not quite right, how top,python this navigation bar also crawled out?
We only need the content in the red circle:

It seems that our XPath statement is a bit problematic, not just grabbing the name of the project we need, but also catching some innocent but XPath-syntactically identical elements.
Review elements we find we need <ul> have class= ' directory-url ' attributes,
So just change the XPath statement to Sel.xpath ('//ul[@class = "Directory-url"]/li ')
Make the following adjustments to the XPath statement:

The code is as follows:
From Scrapy.spider import spider
From Scrapy.selector import Selector

Class Dmozspider (Spider):
Name = "DMOZ"
Allowed_domains = ["dmoz.org"]
Start_urls = [
"Http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
"Http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"
]

Def parse (self, Response):
SEL = Selector (response)
Sites = Sel.xpath ('//ul[@class = ' directory-url ']/li ')
For site in sites:
title = Site.xpath (' A/text () '). Extract ()
link = site.xpath (' a @href '). Extract ()
desc = Site.xpath (' text () '). Extract ()
Print title

Successfully grabbed all the headlines, absolutely no indiscriminate killing of innocents:

3.5 Using Item
Now let's take a look at how to use item.
As we said earlier, the Item object is a custom Python dictionary that you can use to get the value of a property using standard dictionary syntax:

The code is as follows:
>>> item = Dmozitem ()
>>> item[' title '] = ' Example title '
>>> item[' title ']
' Example title '

As a reptile, spiders wants to store the data it crawls into the item object. In order to return to our fetch data, the spider's final code should be this:

The code is as follows:
From Scrapy.spider import spider
From Scrapy.selector import Selector

From Tutorial.items import Dmozitem

Class Dmozspider (Spider):
Name = "DMOZ"
Allowed_domains = ["dmoz.org"]
Start_urls = [
"Http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
"Http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"
]

Def parse (self, Response):
SEL = Selector (response)
Sites = Sel.xpath ('//ul[@class = ' directory-url ']/li ')
Items = []
For site in sites:
item = Dmozitem ()
item[' title '] = Site.xpath (' A/text () '). Extract ()
item[' link ' = Site.xpath (' A ' @href '). Extract ()
item[' desc '] = Site.xpath (' text () '). Extract ()
Items.append (item)
return items

4. Storage content (Pipeline)
The simplest way to save information is through the feed exports, there are four main types: Json,json lines,csv,xml.
We export the results in the most commonly used JSON, with the following commands:

The code is as follows:
Scrapy Crawl Dmoz-o items.json-t JSON

-O is followed by the export file name, and-T followed by the export type.
Then take a look at the results of the export, open the JSON file with a text editor (for easy display, delete the attribute except the title in item):

Because this is just a small example, so simple processing is possible.
If you want to use the crawled items to do something more complicated, you can write an item Pipeline (entry pipeline).
We'll play ^_^ later.

The above is the Python crawler frame scrapy production crawler Crawl site content of the entire process, very detailed it, I hope to be able to help you, if necessary, you can contact me, progress together


0 Base Write Python crawler using scrapy framework to write crawler

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.