Scrapy getting started, scrapy getting started

Last Update:2015-03-12 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

What is Scrapy?
Scrapy is an open-source python crawler framework based on Twisted. We only need to customize several simple modules to crawl network data.
Overall architecture of Scrapy

The figure above briefly explains:
The raw material for Crawler processing is one or more URLs. During crawling, Sheduler assigns a url to Downloader for a network request, after the request is complete, Downloader then uploads the response to Spiders. If the returned data is needed, the data will be processed as corresponding items and handed over to ItemPipeline for storage and other processing. If the returned url is still the url to be processed, it will be handed over to Scheduler for another processing process.

3. Scrapy Installation

Sudo pip_install scrapy or sudo easy_intall scrapy

Enter the password to complete the installation. Run scrapy. If no command is not found, the installation is successful.

4. Create a project

scrapy startproject project_name

If the above prompt is displayed, the project is successfully created. Switch to the project directory and we will see the following directory structure

AppScrapy/
Configuration information of the entire scrapy. cfg Project
AppScrapy/folder for storing all python custom modules
Init. py
Items. py stores the data structure of the crawled data
Pipelines. py data stream processes the file and processes the crawled data stream
Settings. py setting file. Set the database here.
Spiders/Stores our custom Crawlers
Init. py
...
Let's look at the full target, appStore entertainment ranking https://itunes.apple.com/cn/genre/ios-yu-le/id6016? Mt = 8

The data we want to crawl is the url of the app name in the list and its specific information.

First, we will customize the items. py that we use to save the data type, open items. py, and add the following code:

import scrapyclass AppscrapyItem(scrapy.Item):    # define the fields for your item here like:    name = scrapy.Field()    url = scrapy.Field()

All items are inherited from scrapy. item, and the fields in them are of the scrapy. Field () type. scrapy. Field () can receive any data type.

Now it's time to customize our crawler.

Create an AppScrapy. py file in the spiders folder and add the following code.

from scrapy.spider import BaseSpiderfrom appScrapy.items import AppscrapyItemclass AppScrapy(BaseSpider):    name = 'app_scrapy'    start_urls = ["https://itunes.apple.com/cn/genre/ios-yu-le/id6016?mt=8"]    def parse(self, response):        result = []        lis = response.xpath("//div[@class='grid3-column']/div")        for li in lis:            array = li.xpath("./ul/li")            for node in array:                item = AppscrapyItem()                item["name"] = node.xpath("./a/text()").extract()                item["url"] = node.xpath("./a/@href").extract()                result.append(item)        return result

All crawler classes must inherit from BaseSpider and define a name, because this name is used to start crawlers. The crawler is required for a url array to know where to go, and the parse method must be implemented. Here, we can filter the crawled data to get what we want.

When we start this crawler (scrapy crawl app_scrapy), Scrapy extracts the first url from start_urls, initiates a request using this url, and uses parse as the callback function of the request, the response in the callback function is the response to the request.

We use the xpath Method for content selection. In the xpath method, we need to input a path to return a selector array.

For the path, we can use Chrome's developer tool, as shown in. When we want to obtain the content, we only need to select the content under the Element tab, and then right-click and select copy xPath

lis = response.xpath("//div[@class='grid3-column']/div")

First, we use xpath to obtain all the divs in the class = 'grid3-column 'div and return an array of values. From the above picture, we can see that the array should contain three selector representing the div.

Shows the content of each div. We retrieve each div and parse its content.

              for li in lis:            array = li.xpath("./ul/li")            for node in array:                item = AppscrapyItem()                item["name"] = node.xpath("./a/text()").extract()                item["url"] = node.xpath("./a/@href").extract()                result.append(item)

First, use the for loop to retrieve each div, and then get all li in all ul under the current div. As shown in, we will get a selector array that represents li. Let's look at the structure of li.

The text in the middle is obtained through text (), so the current li text path is "./a/text ()". ", indicating that the current selector starts. If this is returned, xpath returns a selector. To obtain its true value, we also need to call extract () to return an array of its actual literal value.
To obtain the attribute value of a field, you need to use @, as shown above @ href, and then assign these values to the item we have written.

Of course, you have to save the data. It is not finished yet. How can I save the data to the database next time.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Scrapy getting started, scrapy getting started

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Scrapy getting started, scrapy getting started

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support