Python crawler (6) Principles of Scrapy framework, pythonscrapy
Scrapy framework
About Scrapy
Scrapy is an application framework written with pure Python to crawl website data and extract structural data. It is widely used.
With the strength of the Framework, users can easily implement a crawler by customizing and developing several modules to capture webpage content and various images, which is very convenient.
Scrapy uses Twisted['twɪstɪd]
(Its main competitor is Tornado) Asynchronous Network Framework to process network communication can speed up our download speed, without having to implement the asynchronous framework on our own, and contains various middleware interfaces, various requirements can be fulfilled flexibly.
Scrapy Architecture
Scrapy Engine (Engine)
: ResponsibleSpider
,ItemPipeline
,Downloader
,Scheduler
Intermediate communication, signal, data transmission, etc.
Sched)
: It is responsible for acceptingEngine
The requests sent are sorted and arranged in a certain way.Engine
If necessary, return itEngine
.
Downloader)
: DownloadsScrapy Engine (Engine)
Send all Requests and return the obtained ResponsesScrapy Engine (Engine)
,Engine
ToSpider
To process,
Spider)
: It processes all Responses, analyzes and extracts data from them, obtains the data required by the Item field, and submits the URLEngine
, Enter againSched)
,
Item Pipeline (Pipeline)
: It handlesSpider
And perform post-processing (detailed analysis, filtering, storage, etc.
Downloader Middlewares (download middleware)
: You can use it as a component that can customize the extended download function.
Spider Middlewares (Spider middleware)
: You can understand it as a custom extension and operation.Engine
AndSpider
IntermediateCommunication
(For exampleSpider
AndSpider
Requests)
Explain Scrapy operation process in Vernacular
Write the code and run the program...
Steps for creating a Scrapy Crawler
1. Create a project
scrapy startproject mySpider
Scrapy. cfg: project configuration file mySpider/: Python module of the project. The Code mySpider/items will be referenced here. py: the project's target file mySpider/pipelines. py: the project's MPs queue file mySpider/settings. py: Project setting file mySpider/spiders/: stores the crawler code directory
2. Define the target (mySpider/items. py)
What information do you want to crawl? Define the structured data field in the Item to save the crawled data
3. Create a crawler (spiders/xxxxSpider. py)
import scrapyclass ItcastSpider(scrapy.Spider): name = "itcast" allowed_domains = ["itcast.cn"] start_urls = ( 'http://www.itcast.cn/', ) def parse(self, response): pass
name = ""
: The identification name of this crawler must be unique, and different names must be defined for different crawlers.
allow_domains = []
Is the scope of the domain name to be searched, that is, the restricted area of the crawler. It requires that the crawler only crawls the webpage under the domain name, And the nonexistent URL will be ignored.
start_urls = ()
: The URL ancestor/list to be crawled. Crawlers start to capture data from here, so the data downloaded for the first time will start from these urls. Other sub-URLs are generated from these starting URLs.
parse(self, response)
: Resolution method. After each initial URL is downloaded, it is called. When called, it is passed into the Response object returned from each URL as a unique parameter. The main function is as follows:
4. Save data (pipelines. py)
Set the method for saving data in the MPs queue file, which can be saved locally or in a database.
Reminder
When running the scrapy project for the first time
-->"DLL load failed"Error message: You must install the pypiwin32 module.
First, write a simple getting started instance.
(1) items. py
Information to be crawled
# -*- coding: utf-8 -*-import scrapyclass ItcastItem(scrapy.Item): name = scrapy.Field() title = scrapy.Field() info = scrapy.Field()
(2) itcastspider. py
Write Crawler
#! /Usr/bin/env python #-*-coding: UTF-8-*-import scrapyfrom mySpider. items import ItcastItem # create a crawler class ItcastSpider (scrapy. spider): # crawler name = "itcast" # Scope of the crawler permitted allowd_domains = ["http://www.itcast.cn/"] # Starting url of the crawler start_urls = ["http://www.itcast.cn/channel/teacher.shtml#",] def parse (self, response): teacher_list = response. xpath ('// div [@ class = "li_txt"]') # teacherItem = [] # traverse the root node set for each in teacher_list: # The Item object used to save data. item = ItcastItem () # name, extract () converts the matched results to Unicode strings # extract () is not added () the result is the xpath matching object name = each. xpath ('. /h3/text ()'). extract () # title = each. xpath ('. /h4/text ()'). extract () # info = each. xpath ('. /p/text ()'). extract () item ['name'] = name [0]. encode ("gbk") item ['title'] = title [0]. encode ("gbk") item ['info'] = info [0]. encode ("gbk") teacherItem. append (item) return teacherItem
Run scrapy crawl itcast-o itcast.csv to save it as ". csv ".
Pipeline file pipelines. py usage
(1) setting. py Modification
ITEM_PIPELINES ={# set the class 'myspider. pipelines. ItcastPipeline ': 300,} to be written in the pipeline file ,}
(2) itcastspider. py
#! /Usr/bin/env python #-*-coding: UTF-8-*-import scrapyfrom mySpider. items import ItcastItem # create a crawler class ItcastSpider (scrapy. spider): # crawler name = "itcast" # Scope of the crawler permitted allowd_domains = ["http://www.itcast.cn/"] # crawler actually url start_urls = ["http://www.itcast.cn/channel/teacher.shtml#aandroid",] def parse (self, response): # with open ("teacher.html", "w") as f: # f. write (response. body) # match the root node list set of all instructors using the xpath provided by scrapy. teacher_list = response. xpath ('// div [@ class = "li_txt"]') # traverse the root node set for each in teacher_list: # Item object used to save data item = ItcastItem () # name, extract () convert the matched result to a Unicode string # The result without extract () is the name of the xpath matching object = each. xpath ('. /h3/text ()'). extract () # title = each. xpath ('. /h4/text ()'). extract () # info = each. xpath ('. /p/text ()'). extract () item ['name'] = name [0] item ['title'] = title [0] item ['info'] = info [0] yield item
(3) pipelines. py
Save data locally
#-*-Coding: UTF-8-*-import jsonclass ItcastPipeline (object): # _ init _ the method is optional, as the class initialization method def _ init _ (self): # creates a file self. filename = open ("teacher. json "," w ") # The process_item method must be written to process the item data def process_item (self, item, spider): jsontext = json. dumps (dict (item), ensure_ascii = False) + "\ n" self. filename. write (jsontext. encode ("UTF-8") return item # The close_spider method is optional. Call this method def close_spider (self, spider) at the end: self. filename. close ()
(4) items. py
# -*- coding: utf-8 -*-import scrapyclass ItcastItem(scrapy.Item): name = scrapy.Field() title = scrapy.Field() info = scrapy.Field()