Python crawler (6) Principles of Scrapy framework, pythonscrapy

Source: Internet
Author: User

Python crawler (6) Principles of Scrapy framework, pythonscrapy
Scrapy framework

About Scrapy

  • Scrapy is an application framework written with pure Python to crawl website data and extract structural data. It is widely used.

  • With the strength of the Framework, users can easily implement a crawler by customizing and developing several modules to capture webpage content and various images, which is very convenient.

  • Scrapy uses Twisted['twɪstɪd](Its main competitor is Tornado) Asynchronous Network Framework to process network communication can speed up our download speed, without having to implement the asynchronous framework on our own, and contains various middleware interfaces, various requirements can be fulfilled flexibly.

Scrapy Architecture

  • Scrapy Engine (Engine): ResponsibleSpider,ItemPipeline,Downloader,SchedulerIntermediate communication, signal, data transmission, etc.

  • Sched): It is responsible for acceptingEngineThe requests sent are sorted and arranged in a certain way.EngineIf necessary, return itEngine.

  • Downloader): DownloadsScrapy Engine (Engine)Send all Requests and return the obtained ResponsesScrapy Engine (Engine),EngineToSpiderTo process,

  • Spider): It processes all Responses, analyzes and extracts data from them, obtains the data required by the Item field, and submits the URLEngine, Enter againSched),

  • Item Pipeline (Pipeline): It handlesSpiderAnd perform post-processing (detailed analysis, filtering, storage, etc.

  • Downloader Middlewares (download middleware): You can use it as a component that can customize the extended download function.

  • Spider Middlewares (Spider middleware): You can understand it as a custom extension and operation.EngineAndSpiderIntermediateCommunication(For exampleSpiderAndSpiderRequests)

Explain Scrapy operation process in Vernacular

Write the code and run the program...

Steps for creating a Scrapy Crawler

1. Create a project

scrapy startproject mySpider

Scrapy. cfg: project configuration file mySpider/: Python module of the project. The Code mySpider/items will be referenced here. py: the project's target file mySpider/pipelines. py: the project's MPs queue file mySpider/settings. py: Project setting file mySpider/spiders/: stores the crawler code directory

2. Define the target (mySpider/items. py)

What information do you want to crawl? Define the structured data field in the Item to save the crawled data

3. Create a crawler (spiders/xxxxSpider. py)

import scrapyclass ItcastSpider(scrapy.Spider):    name = "itcast"    allowed_domains = ["itcast.cn"]    start_urls = (        'http://www.itcast.cn/',    )    def parse(self, response):        pass
  • name = "": The identification name of this crawler must be unique, and different names must be defined for different crawlers.

  • allow_domains = []Is the scope of the domain name to be searched, that is, the restricted area of the crawler. It requires that the crawler only crawls the webpage under the domain name, And the nonexistent URL will be ignored.

  • start_urls = (): The URL ancestor/list to be crawled. Crawlers start to capture data from here, so the data downloaded for the first time will start from these urls. Other sub-URLs are generated from these starting URLs.

  • parse(self, response): Resolution method. After each initial URL is downloaded, it is called. When called, it is passed into the Response object returned from each URL as a unique parameter. The main function is as follows:

4. Save data (pipelines. py)

Set the method for saving data in the MPs queue file, which can be saved locally or in a database.

Reminder

When running the scrapy project for the first time

-->"DLL load failed"Error message: You must install the pypiwin32 module.

First, write a simple getting started instance.

(1) items. py

Information to be crawled

# -*- coding: utf-8 -*-import scrapyclass ItcastItem(scrapy.Item):    name = scrapy.Field()    title = scrapy.Field()    info = scrapy.Field()

(2) itcastspider. py

Write Crawler

#! /Usr/bin/env python #-*-coding: UTF-8-*-import scrapyfrom mySpider. items import ItcastItem # create a crawler class ItcastSpider (scrapy. spider): # crawler name = "itcast" # Scope of the crawler permitted allowd_domains = ["http://www.itcast.cn/"] # Starting url of the crawler start_urls = ["http://www.itcast.cn/channel/teacher.shtml#",] def parse (self, response): teacher_list = response. xpath ('// div [@ class = "li_txt"]') # teacherItem = [] # traverse the root node set for each in teacher_list: # The Item object used to save data. item = ItcastItem () # name, extract () converts the matched results to Unicode strings # extract () is not added () the result is the xpath matching object name = each. xpath ('. /h3/text ()'). extract () # title = each. xpath ('. /h4/text ()'). extract () # info = each. xpath ('. /p/text ()'). extract () item ['name'] = name [0]. encode ("gbk") item ['title'] = title [0]. encode ("gbk") item ['info'] = info [0]. encode ("gbk") teacherItem. append (item) return teacherItem

Run scrapy crawl itcast-o itcast.csv to save it as ". csv ".

Pipeline file pipelines. py usage

(1) setting. py Modification

ITEM_PIPELINES ={# set the class 'myspider. pipelines. ItcastPipeline ': 300,} to be written in the pipeline file ,}

(2) itcastspider. py

#! /Usr/bin/env python #-*-coding: UTF-8-*-import scrapyfrom mySpider. items import ItcastItem # create a crawler class ItcastSpider (scrapy. spider): # crawler name = "itcast" # Scope of the crawler permitted allowd_domains = ["http://www.itcast.cn/"] # crawler actually url start_urls = ["http://www.itcast.cn/channel/teacher.shtml#aandroid",] def parse (self, response): # with open ("teacher.html", "w") as f: # f. write (response. body) # match the root node list set of all instructors using the xpath provided by scrapy. teacher_list = response. xpath ('// div [@ class = "li_txt"]') # traverse the root node set for each in teacher_list: # Item object used to save data item = ItcastItem () # name, extract () convert the matched result to a Unicode string # The result without extract () is the name of the xpath matching object = each. xpath ('. /h3/text ()'). extract () # title = each. xpath ('. /h4/text ()'). extract () # info = each. xpath ('. /p/text ()'). extract () item ['name'] = name [0] item ['title'] = title [0] item ['info'] = info [0] yield item

(3) pipelines. py

Save data locally

#-*-Coding: UTF-8-*-import jsonclass ItcastPipeline (object): # _ init _ the method is optional, as the class initialization method def _ init _ (self): # creates a file self. filename = open ("teacher. json "," w ") # The process_item method must be written to process the item data def process_item (self, item, spider): jsontext = json. dumps (dict (item), ensure_ascii = False) + "\ n" self. filename. write (jsontext. encode ("UTF-8") return item # The close_spider method is optional. Call this method def close_spider (self, spider) at the end: self. filename. close ()

(4) items. py

# -*- coding: utf-8 -*-import scrapyclass ItcastItem(scrapy.Item):    name = scrapy.Field()    title = scrapy.Field()    info = scrapy.Field()

 

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.