The first use of the Python scrapy crawler framework

Source: Internet
Author: User
Tags xpath python scrapy

This case comes from the turtle's course

There are ways to install the scrapy on the Internet, which is no longer described here.

Using Scrapy to crawl a website takes four steps:

0, create a scrapy project;

1, define the item container;

2, write crawler;

3, storage content.

The goal of this crawl is the world's largest directory site http://www.dmoztools.net, because this site data is too large, we only take its two sub-pages to do the test (manual cover face)

http://www.dmoztools.net/Computers/Programming/Languages/Python/Books/

http://www.dmoztools.net/Computers/Programming/Languages/Python/Resources/

What we need to crawl is a list of http://www.dmoztools.net/Computers/Programming/Languages/Python/Books/, hyperlinks, and list descriptions in the sub-page.

---

0. Create a scrapy project

First, open the cmd window and enter the following command to move the path to the desktop (other locations can also)

CD Desktop

Then create a scrapy project. Continue to enter the following command on the command line (please do not close the command line after the command is executed, etc.)

Scrapy Startproject Tutorial

(Create a new Scrapy project folder, the folder name Tutorial) folder will have a scrapy.cfg configuration file and a tutorial subfolder, the Tutorial folder has the following content:

0, __init__.py This needless to say, is the initialization of the module, no tube

1. Container for items.py Project

2, pipelines.py

3. settings.py Setup File

4, spiders/folder, there is only one initialization file __init__.py, here we need to improve.

Wait for the file (there may be some other documents, not much to describe here)

1. Define the item container

What is the item container? The item container is a container for the data that is crawled, similar to the Python dictionary, and provides an additional protection mechanism to avoid undefined field errors caused by spelling errors.

Next we're going to model. Why would you want to model it? Because the item container is used to store the content of the Web page that we crawl to, when we crawl, we return the content of the whole Web page, and often we just need some of the content, such as the page http://www.dmoztools.net/Computers/ programming/languages/python/books/, we only need his title, description, and hyperlinks.

So here we need to change the item container, open tutorial/item.py, change the contents of the Tutorialitem module to the following and write the comments:

class Tutorialitem (scrapy. Item):    # define the fields for your itemhere is like:    #  name = Scrapy. Field ()    title = Scrapy. Field ()  # title    link = scrapy. Field ()   # hyperlink    desc = scrapy. Field ()   # Description

Save exit.

2. Writing crawler

The next step is to write a crawler spider,spider is a class that users write to crawl data from a Web site.

It contains an initial URL for the download, followed by how to follow the links in the page and how to parse the contents of the page, as well as the method of extracting the item.

Next we create a new file called dmoz_spider.py in the Spider folder and write the following code: (where name is the project name)

ImportscrapyclassDmozspider (scrapy. Spider): Name="DMOZ"    #Allowed_domains is designed to prevent access to other sites while crawlingAllowed_domains = ['dmoztools.net']    #Initial Crawl LocationStar_url = [        'http://www.dmoztools.net/Computers/Programming/Languages/Python/Books/'        'http://www.dmoztools.net/Computers/Programming/Languages/Python/Resources/'        ]        #Define an analysis method    defParse (self, Response): filename= Response.url.split ("/") [-2] with open (filename,'WB') as F:f.write (response,body)

Save Close. Code interpretation (not very good explanation, probably the meaning of understanding on the line, do not like to spray):

In the local directory to create two files, called books and resource, and then the Start_url in the two initial URLs submitted to the Scrapy engine, Scrapy engine has a downloader, will download the source code of the site, The downloaded content is then stored in books and resource according to the parse method.

Next proceed to the cmd window, and enter the following command to switch the path to the Tutorial folder:

CD Tutorial

Continue to enter the following command at the command line:

Scrapy Crawl DMOZ

Where crawl is the meaning of crawling, DMOZ is the project name.

Then enter a carriage return to execute. If successful, the following message will appear:

And after the command executes, two new files are generated in the current directory, called Books and resource. The contents of the two files are crawled to the site content (that is, the source code of the site).

3. Store content

In the past, we used HTML regular expressions when crawling Web content, but in scrapy we used an XPath and CSS-based expression mechanism: Scrapy selectors

Selectors is a selector that has four basic methods:

XPath (): Incoming spath expression that returns the selector list of all nodes corresponding to the expression

CSS (): An incoming CSS expression that returns the selector list of all nodes corresponding to the expression

Extract (): Serializes the node to a Unicode string and returns a list

Re (): Extracts the data based on the incoming regular expression, returning the Unicode string list

Before storing the data we advanced to Scrapy's shell window for testing. Continue to the cmd window just now, enter the following command to enter the shell window:

" http://www.dmoztools.net/Computers/Programming/Languages/Python/Books/ "

The purpose is to enter the Scrapy shell window of the website, with the following interface:

At this point we can operate on him.

In fact, this is the return of our response object, we can do a series of operations on him, such as Response.body command will appear the source code of the site, response.headers command will appear the site header configuration and so on

Now let's list several ways to use this object:

/html/head/title: Select the <title> element within the

/html/head/title/text (): Select the text of the <title> element mentioned above;

TD: Select all the <td> elements;

div[@class = ' mine '): Select all DIV elements that have the class= ' mine ' attribute.

There is also a return object sel in the Shell window, followed by a test of the lookup list.

Enter the command: (the following appears the path, please check the contents of the Web page to see, right click on the Page object selection check can appear)

Sel.xpath ('//section/div/div/div/div/a/div/text ()'). Extract ()

All the lists on the webpage will be returned after execution, as shown below:

Similarly, the command below will return hyperlinks and descriptions for all lists

Sel.xpath ('//section/div/div/div/div/a/@href'). Extract ()//Back Hyperlink
Sel.xpath ('//section/div/div/div/div/a/div/text () '). Extract ()//Return list description

OK, we'll go ahead and write the code after the test is done.

Now let's filter the crawl content.

Open Edit the dmoz_spider.py file we just wrote and change the content to the following content

Importscrapy fromTutorial.itemsImportTutorialitem#Introducing ModulesclassDmozspider (scrapy. Spider): Name="DMOZ"    #Allowed_domains is designed to prevent access to other sites while crawlingAllowed_domains = ['dmoztools.net']    #Initial Crawl LocationStart_urls = [        'http://www.dmoztools.net/Computers/Programming/Languages/Python/Books/',        'http://www.dmoztools.net/Computers/Programming/Languages/Python/Resources/'        ]        #Define an analysis method    defParse (self, Response): Sel= Scrapy.selector.Selector (response)#parse the returned response object        #//section/div/div/div/div/a/div/text ()Sites = Sel.xpath ('//section/div/div/div/div')#Filter your contentItems = []             forSiteinchSites#to further filter crawl contentitem =Tutorialitem () item['title'] = Site.xpath ('A/div/text ()'). Extract () item['Link'] = Site.xpath ('A/ @href'). Extract () item['desc'] = Site.xpath ('Div/text ()'). Extract () Items.append (item)#Add filtered content to list items                    returnItems

Save exit. Then proceed to the cmd window and exit the Shell window first:

Exit ()

Then start crawling the site content and export the filtered data to the following command:

Scrapy Crawl Dmoz-o items.json-t JSON

-O followed by the exported file name,-T followed by the export form, the form commonly used in four kinds: JSON, XML, Jsonlines and CSV, here I use JSON.

After execution in the current path will produce a Items.json file, we use a text editor to open it, we will find that this is our secondary crawl of the filtered content:

This crawl here will be completed successfully!

Thanks for watching! (??????)??

I did a lot of tragedies, and in the end you all said it was comedy. ------------------Stephen Chow Chi

The first use of the Python scrapy crawler framework

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.