Scrapy getting started, scrapy getting started

Source: Internet
Author: User

Scrapy getting started, scrapy getting started

  1. What is Scrapy?

    Scrapy is an open-source python crawler framework based on Twisted. We only need to customize several simple modules to crawl network data.

  2. Overall architecture of Scrapy

The figure above briefly explains:
The raw material for Crawler processing is one or more URLs. During crawling, Sheduler assigns a url to Downloader for a network request, after the request is complete, Downloader then uploads the response to Spiders. If the returned data is needed, the data will be processed as corresponding items and handed over to ItemPipeline for storage and other processing. If the returned url is still the url to be processed, it will be handed over to Scheduler for another processing process.

3. Scrapy Installation

Sudo pip_install scrapy or sudo easy_intall scrapy

Enter the password to complete the installation. Run scrapy. If no command is not found, the installation is successful.

4. Create a project

scrapy startproject project_name

If the above prompt is displayed, the project is successfully created. Switch to the project directory and we will see the following directory structure

AppScrapy/
Configuration information of the entire scrapy. cfg Project
AppScrapy/folder for storing all python custom modules
Init. py
Items. py stores the data structure of the crawled data
Pipelines. py data stream processes the file and processes the crawled data stream
Settings. py setting file. Set the database here.
Spiders/Stores our custom Crawlers
Init. py
...
Let's look at the full target, appStore entertainment ranking https://itunes.apple.com/cn/genre/ios-yu-le/id6016? Mt = 8

The data we want to crawl is the url of the app name in the list and its specific information.

First, we will customize the items. py that we use to save the data type, open items. py, and add the following code:

import scrapyclass AppscrapyItem(scrapy.Item):    # define the fields for your item here like:    name = scrapy.Field()    url = scrapy.Field()

All items are inherited from scrapy. item, and the fields in them are of the scrapy. Field () type. scrapy. Field () can receive any data type.

Now it's time to customize our crawler.

Create an AppScrapy. py file in the spiders folder and add the following code.

from scrapy.spider import BaseSpiderfrom appScrapy.items import AppscrapyItemclass AppScrapy(BaseSpider):    name = 'app_scrapy'    start_urls = ["https://itunes.apple.com/cn/genre/ios-yu-le/id6016?mt=8"]    def parse(self, response):        result = []        lis = response.xpath("//div[@class='grid3-column']/div")        for li in lis:            array = li.xpath("./ul/li")            for node in array:                item = AppscrapyItem()                item["name"] = node.xpath("./a/text()").extract()                item["url"] = node.xpath("./a/@href").extract()                result.append(item)        return result

All crawler classes must inherit from BaseSpider and define a name, because this name is used to start crawlers. The crawler is required for a url array to know where to go, and the parse method must be implemented. Here, we can filter the crawled data to get what we want.

When we start this crawler (scrapy crawl app_scrapy), Scrapy extracts the first url from start_urls, initiates a request using this url, and uses parse as the callback function of the request, the response in the callback function is the response to the request.

We use the xpath Method for content selection. In the xpath method, we need to input a path to return a selector array.

For the path, we can use Chrome's developer tool, as shown in. When we want to obtain the content, we only need to select the content under the Element tab, and then right-click and select copy xPath

lis = response.xpath("//div[@class='grid3-column']/div")

First, we use xpath to obtain all the divs in the class = 'grid3-column 'div and return an array of values. From the above picture, we can see that the array should contain three selector representing the div.


Shows the content of each div. We retrieve each div and parse its content.

              for li in lis:            array = li.xpath("./ul/li")            for node in array:                item = AppscrapyItem()                item["name"] = node.xpath("./a/text()").extract()                item["url"] = node.xpath("./a/@href").extract()                result.append(item)

First, use the for loop to retrieve each div, and then get all li in all ul under the current div. As shown in, we will get a selector array that represents li. Let's look at the structure of li.

The text in the middle is obtained through text (), so the current li text path is "./a/text ()". ", indicating that the current selector starts. If this is returned, xpath returns a selector. To obtain its true value, we also need to call extract () to return an array of its actual literal value.
To obtain the attribute value of a field, you need to use @, as shown above @ href, and then assign these values to the item we have written.

Of course, you have to save the data. It is not finished yet. How can I save the data to the database next time.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.