Python web crawler based on Scrapy framework (1) _

Python web crawler based on Scrapy framework (1) __python

Last Update:2018-07-30 Source: Internet

Author: User

Tags xpath python web crawler

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

1, build the environment

Here I am using Anaconda,anaconda that integrates a lot of third-party libraries about Python scientific computing, mainly for easy installation and Anaconda with Spyder.
Download Anaconda here
Compare recommended use of Python2.7
Installing Scrapy under Anaconda is also very simple. CMD into the command line, direct input Conda install scrapy, and then point "Y", very simple can be installed successfully.
This builds up the environment. 2. Preliminary understanding of Scrapy

Scrapy official website tutorial, suggest to see OH
The first thing to solve is how to create a new Scrapy project from the command line to the directory where you want to create a new project, Scrapy Startproject Newone
Open the directory to see a new folder, open the folder you can see:
Where the items in items.py act as containers to load the crawled data, and it is structured like a dictionary in Python. Open your items.py to see the following code:
Name = Scrapy. Field () is a typical item that is used as a container for the name item being crawled

Class Newoneitem (Scrapy. Item):
    # define the fields for your item here like:
    # name = Scrapy. Field () Pass

Describes the container for storing crawled data, so how do you crawl data? Before that, we need some knowledge of XPath.

XPath tutorials
Take a few simple examples to take a look at XPath usage:/html/head/title Select the title element under the head element of the directory HTML/html/head/title/text () Select the text content of the title element//td Select all TD element//div[@class = "Mine"] Select all attributes class= "Mine" div elements

In order to facilitate the use of xpaths,scrapy to provide selector classes, within Scrapy, there are four basic methods of selectors: XPath (): Return a series of selectors, Each select represents an XPath parameter expression selected node CSS (): Returns a series of selectors, each select represents a node of a CSS parameter expression selection extract (): Returns a Unicode string for the selected data re ( ): Returns a string of Unicode strings for content crawled using regular expressions

Get a deeper understanding of a practical example.
Use the following Web site as an example
http://www.dmoz.org/Computers/Programming/Languages/Python/Books/
Use the shell to crawl Web pages to observe the functionality of XPath
On the command line, enter:

After the shell is loaded, you will get a response response, stored in the local variable response.
So if you enter Response.body, you will see the body part of the response, which is the content of the page being crawled:

Similarly, you can enter Response.headers:

Next look at the Web page:

Let's crawl through the Computers,programming,languages,python elements and right-click on the browser to see the HTML code.

Write from this code

Sel.xpath ('//a[@class = "breadcrumb"]/text ()). Extract ()

Where the SEL is the selector object

XPath is flexible enough to produce the same result in different ways.

Next we need to write the actual crawl data code, we can learn from the actual examples, please look forward to ing

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More