Python web crawler based on Scrapy framework (1) __python

Source: Internet
Author: User
Tags xpath python web crawler
1, build the environment

Here I am using Anaconda,anaconda that integrates a lot of third-party libraries about Python scientific computing, mainly for easy installation and Anaconda with Spyder.
Download Anaconda here
Compare recommended use of Python2.7
Installing Scrapy under Anaconda is also very simple. CMD into the command line, direct input Conda install scrapy, and then point "Y", very simple can be installed successfully.
This builds up the environment. 2. Preliminary understanding of Scrapy

Scrapy official website tutorial, suggest to see OH
The first thing to solve is how to create a new Scrapy project from the command line to the directory where you want to create a new project, Scrapy Startproject Newone
Open the directory to see a new folder, open the folder you can see:
Where the items in items.py act as containers to load the crawled data, and it is structured like a dictionary in Python. Open your items.py to see the following code:
Name = Scrapy. Field () is a typical item that is used as a container for the name item being crawled

Class Newoneitem (Scrapy. Item):
    # define the fields for your item here like:
    # name = Scrapy. Field () Pass
    

Describes the container for storing crawled data, so how do you crawl data? Before that, we need some knowledge of XPath.

XPath tutorials
Take a few simple examples to take a look at XPath usage:/html/head/title Select the title element under the head element of the directory HTML/html/head/title/text () Select the text content of the title element//td Select all TD element//div[@class = "Mine"] Select all attributes class= "Mine" div elements

In order to facilitate the use of xpaths,scrapy to provide selector classes, within Scrapy, there are four basic methods of selectors: XPath (): Return a series of selectors, Each select represents an XPath parameter expression selected node CSS (): Returns a series of selectors, each select represents a node of a CSS parameter expression selection extract (): Returns a Unicode string for the selected data re ( ): Returns a string of Unicode strings for content crawled using regular expressions

Get a deeper understanding of a practical example.
Use the following Web site as an example
http://www.dmoz.org/Computers/Programming/Languages/Python/Books/
Use the shell to crawl Web pages to observe the functionality of XPath
On the command line, enter:

After the shell is loaded, you will get a response response, stored in the local variable response.
So if you enter Response.body, you will see the body part of the response, which is the content of the page being crawled:

Similarly, you can enter Response.headers:

Next look at the Web page:

Let's crawl through the Computers,programming,languages,python elements and right-click on the browser to see the HTML code.

Write from this code

Sel.xpath ('//a[@class = "breadcrumb"]/text ()). Extract ()

Where the SEL is the selector object

XPath is flexible enough to produce the same result in different ways.

Next we need to write the actual crawl data code, we can learn from the actual examples, please look forward to ing

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.