Introduction to Reptile--scrapy

Last Update:2018-09-29 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Scrapy

Installing Scrapy

Pip Install Scrapy

Windows may fail to install, you need to install a C + + library or twisted first,pip install twisted

Create a project

Scrapy Startproject Tutorial

The command will create a tutorial directory with the following content:

tutorial/    scrapy.cfg    Tutorial/        __init__. py        items.py        pipelines.py        settings.py        Spiders/            __init__ . py ...    SCRAPY.CFG: The project's configuration file    tutorial/: The project's Python module. You will then join the code here.    Tutorial/items.py: Item file in Project.    Tutorial/pipelines.py: The pipelines file in the project.    Tutorial/settings.py: The setup file for the project.    Tutorial/spiders/: The directory where the spider code is placed.

Write the first crawler

In order to create a spider, you must inherit scrapy. Spider class, define the following three properties

Scrapy genspider dmoz dmoz.com Terminal command You can do this step directly

Property
- Name: Used to differentiate the spider. The name must be unique and you cannot set the same name for different spiders.
- Start_urls: Contains a list of URLs that spiders crawl at startup. Therefore, the first page to be fetched will be one of them. Subsequent URLs are extracted from the data retrieved from the initial URL.
- Parse () is a method of the spider. When called, each initial URL is completed after the download is generated
- The Response object is passed to the function as a unique parameter. The method is responsible for parsing the returned data (response data), extracting it (generating item), and generating a request object that requires further processing of the URL

1 Importscrapy2 3 classDmozspider (scrapy. Spider):4Name ="DMOZ"5Allowed_domains = ["dmoz.org"]6Start_urls = [7         "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",8         "http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"9     ]Ten  One     defParse (self, response): Afilename = Response.url.split ("/") [-2] -with open (filename,'WB') as F: -F.write (Response.body)

Crawl

scrapy crawl dmoz

Procedure: Scrapy creates a scrapy for each URL in the Start_urls property of the spider. The Request object and assigns the parse method as a callback function (callback) to the Request;request object to be dispatched, executes the generated Scrapy.http.Response object, and sends it back to the spider Parse () method.

    xpath(): 传入xpath表达式，返回该表达式所对应的所有节点的selector list列表 。 css(): 传入CSS表达式，返回该表达式所对应的所有节点的selector list列表. extract(): 序列化该节点为unicode字符串并返回list。 re(): 根据传入的正则表达式对数据进行提取，返回unicode字符串list列表。

Scrapy Shell

shell "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/"

Response
- Response.body: Package Body
- Response.headers: Baotou
- Response.xpath (): XPath selector
- RESPONSE.CSS (): CSS Selector

1 Importscrapy2 3 classDmozspider (scrapy. Spider):4Name ="DMOZ"5Allowed_domains = ["dmoz.org"]6Start_urls = [7         "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",8         "http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"9     ]Ten  One     defParse (self, response): A          forSelinchResponse.xpath ('//ul/li'): -title = Sel.xpath ('A/text ()'). Extract () -link = sel.xpath ('A/ @href'). Extract () thedesc = Sel.xpath ('text ()'). Extract () -             PrintTitle, LINK, desc

Please use your phone "sweep" x

Introduction to Reptile--scrapy

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Introduction to Reptile--scrapy

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Introduction to Reptile--scrapy

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support