Python crawler (6) Principles of Scrapy framework, pythonscrapy

Last Update:2018-02-26 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Python crawler (6) Principles of Scrapy framework, pythonscrapy
Scrapy framework

About Scrapy

Scrapy is an application framework written with pure Python to crawl website data and extract structural data. It is widely used.
With the strength of the Framework, users can easily implement a crawler by customizing and developing several modules to capture webpage content and various images, which is very convenient.
Scrapy uses Twisted['twɪstɪd](Its main competitor is Tornado) Asynchronous Network Framework to process network communication can speed up our download speed, without having to implement the asynchronous framework on our own, and contains various middleware interfaces, various requirements can be fulfilled flexibly.

Scrapy Architecture

Scrapy Engine (Engine): ResponsibleSpider,ItemPipeline,Downloader,SchedulerIntermediate communication, signal, data transmission, etc.
Sched): It is responsible for acceptingEngineThe requests sent are sorted and arranged in a certain way.EngineIf necessary, return itEngine.
Downloader): DownloadsScrapy Engine (Engine)Send all Requests and return the obtained ResponsesScrapy Engine (Engine),EngineToSpiderTo process,
Spider): It processes all Responses, analyzes and extracts data from them, obtains the data required by the Item field, and submits the URLEngine, Enter againSched),
Item Pipeline (Pipeline): It handlesSpiderAnd perform post-processing (detailed analysis, filtering, storage, etc.
Downloader Middlewares (download middleware): You can use it as a component that can customize the extended download function.
Spider Middlewares (Spider middleware): You can understand it as a custom extension and operation.EngineAndSpiderIntermediateCommunication(For exampleSpiderAndSpiderRequests)

Explain Scrapy operation process in Vernacular

Write the code and run the program...

Steps for creating a Scrapy Crawler

1. Create a project

scrapy startproject mySpider

Scrapy. cfg: project configuration file mySpider/: Python module of the project. The Code mySpider/items will be referenced here. py: the project's target file mySpider/pipelines. py: the project's MPs queue file mySpider/settings. py: Project setting file mySpider/spiders/: stores the crawler code directory

2. Define the target (mySpider/items. py)

What information do you want to crawl? Define the structured data field in the Item to save the crawled data

3. Create a crawler (spiders/xxxxSpider. py)

import scrapyclass ItcastSpider(scrapy.Spider):    name = "itcast"    allowed_domains = ["itcast.cn"]    start_urls = (        'http://www.itcast.cn/',    )    def parse(self, response):        pass

name = "": The identification name of this crawler must be unique, and different names must be defined for different crawlers.
allow_domains = []Is the scope of the domain name to be searched, that is, the restricted area of the crawler. It requires that the crawler only crawls the webpage under the domain name, And the nonexistent URL will be ignored.
start_urls = (): The URL ancestor/list to be crawled. Crawlers start to capture data from here, so the data downloaded for the first time will start from these urls. Other sub-URLs are generated from these starting URLs.
parse(self, response): Resolution method. After each initial URL is downloaded, it is called. When called, it is passed into the Response object returned from each URL as a unique parameter. The main function is as follows:

4. Save data (pipelines. py)

Set the method for saving data in the MPs queue file, which can be saved locally or in a database.

Reminder

When running the scrapy project for the first time

-->"DLL load failed"Error message: You must install the pypiwin32 module.

First, write a simple getting started instance.

(1) items. py

Information to be crawled

# -*- coding: utf-8 -*-import scrapyclass ItcastItem(scrapy.Item):    name = scrapy.Field()    title = scrapy.Field()    info = scrapy.Field()

(2) itcastspider. py

Write Crawler

#! /Usr/bin/env python #-*-coding: UTF-8-*-import scrapyfrom mySpider. items import ItcastItem # create a crawler class ItcastSpider (scrapy. spider): # crawler name = "itcast" # Scope of the crawler permitted allowd_domains = ["http://www.itcast.cn/"] # Starting url of the crawler start_urls = ["http://www.itcast.cn/channel/teacher.shtml#",] def parse (self, response): teacher_list = response. xpath ('// div [@ class = "li_txt"]') # teacherItem = [] # traverse the root node set for each in teacher_list: # The Item object used to save data. item = ItcastItem () # name, extract () converts the matched results to Unicode strings # extract () is not added () the result is the xpath matching object name = each. xpath ('. /h3/text ()'). extract () # title = each. xpath ('. /h4/text ()'). extract () # info = each. xpath ('. /p/text ()'). extract () item ['name'] = name [0]. encode ("gbk") item ['title'] = title [0]. encode ("gbk") item ['info'] = info [0]. encode ("gbk") teacherItem. append (item) return teacherItem

Run scrapy crawl itcast-o itcast.csv to save it as ". csv ".

Pipeline file pipelines. py usage

(1) setting. py Modification

ITEM_PIPELINES ={# set the class 'myspider. pipelines. ItcastPipeline ': 300,} to be written in the pipeline file ,}

(2) itcastspider. py

#! /Usr/bin/env python #-*-coding: UTF-8-*-import scrapyfrom mySpider. items import ItcastItem # create a crawler class ItcastSpider (scrapy. spider): # crawler name = "itcast" # Scope of the crawler permitted allowd_domains = ["http://www.itcast.cn/"] # crawler actually url start_urls = ["http://www.itcast.cn/channel/teacher.shtml#aandroid",] def parse (self, response): # with open ("teacher.html", "w") as f: # f. write (response. body) # match the root node list set of all instructors using the xpath provided by scrapy. teacher_list = response. xpath ('// div [@ class = "li_txt"]') # traverse the root node set for each in teacher_list: # Item object used to save data item = ItcastItem () # name, extract () convert the matched result to a Unicode string # The result without extract () is the name of the xpath matching object = each. xpath ('. /h3/text ()'). extract () # title = each. xpath ('. /h4/text ()'). extract () # info = each. xpath ('. /p/text ()'). extract () item ['name'] = name [0] item ['title'] = title [0] item ['info'] = info [0] yield item

(3) pipelines. py

Save data locally

#-*-Coding: UTF-8-*-import jsonclass ItcastPipeline (object): # _ init _ the method is optional, as the class initialization method def _ init _ (self): # creates a file self. filename = open ("teacher. json "," w ") # The process_item method must be written to process the item data def process_item (self, item, spider): jsontext = json. dumps (dict (item), ensure_ascii = False) + "\ n" self. filename. write (jsontext. encode ("UTF-8") return item # The close_spider method is optional. Call this method def close_spider (self, spider) at the end: self. filename. close ()

(4) items. py

# -*- coding: utf-8 -*-import scrapyclass ItcastItem(scrapy.Item):    name = scrapy.Field()    title = scrapy.Field()    info = scrapy.Field()

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Python crawler (6) Principles of Scrapy framework, pythonscrapy

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Python crawler (6) Principles of Scrapy framework, pythonscrapy

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support