SCRAPY Project Structure
scrapy. Cfgmyproject/__init__. PY items. PY pipelines. PY settings. PY spiders/__init__. PY spider1. PY spider2. Py
File Description:
- < Span class= "o" > < Span class= "o" >scrapy.cfg: Configuration file for Project
- < Span class= "o" > < Span class= "o" >myproject/: The project's Python module, where the code will be referenced
- < Span class= "o" > < Span class= "o" >myproject/items.py: Target file for project
- < Span class= "o" > < Span class= "o" >myproject/pipelines.py: Pipeline file for Project
- < Span class= "o" > < Span class= "o" >myproject/settings: Project's settings file
- myproject/spiders/: Storage Crawler Code Directory
< SPAN class= "FM" > < Span class= "o" > Project step
< SPAN class= "FM" > < Span class= "o" > One, clear target
- < Span class= "o" > < Span class= "o" > open items.py file
- < Span class= "o" > < The span class= "O" >items define structured data fields to hold the crawled data, a bit like the dict in Python, providing some additional protection from the error
- < Span class= "o" > < Span class= "o" >
- < Span class= "o" > < Span class= "o" > Next, create a custom item subclass, and build the item model
< Span class= "o" > < Span class= "o" > For example:
Import scrapy class Beautyitem (scrapy. Item): = scrapy. Field () = scrapy. Field () = scrapy. Field ()
Second, the production of reptiles
- Enter the command in the current directory
" Allowed_domains "
A crawler file named ' Spider_name ' will be created under the myproject/spiders/directory, and the scope of the crawler (Allowed_domains, which is used to limit the crawl data source to the domain name)
- spider_name . py file in this directory, the following code is added by default:
Import scrapy class Spider_name (scrapy. Spider): "spider_name" = ["allowed_domains "] = ['http://www.allowed_domains'] def Parse (self, Response): Pass
In fact, we can create the crawler files and write the above code, but the use of commands can eliminate the hassle of writing fixed code
To create a spider, you must use Scrapy. The spider class creates a subclass and determines two mandatory properties and a method
- Name = "": Crawler identification name must be unique, different crawlers must define different names
- Allowed_domains = []: Is the domain name range of the search, that is, the crawler's constrained area, which specifies that the crawler crawl only the page under this domain name, the non-existent URL will be ignored (non-essential attributes)
- Start_urls = []: The list of URLs crawled. The crawler starts crawling data from here, so the first data downloaded will start with these URLs, and the other sub-URLs will be generated from the start URL inheritance
- Parse (self, Response): Parses the method, each initial URL completes the download will be called, when the call passes from each URL returns the response object as the unique parameter, the main function is as follows:
- Responsible for parsing the returned web page data (response.body), extracting structured data (Generate item)
- Generate a URL request that requires the next page
Save data
Scrapy the simplest way to save information is mainly four,-o output the file in the specified format, the command is as follows:
# JSON format scrapy crawl spider_name-o file_name.json# JSON line format scrapy crawl Spider_ Name-o file_name.jsonl# csv comma expression, available in Excel open scrapy crawl spider_name-o file_name.csv # XML format scrapy crawl Spider_name-o file_name.xml
Item Pipelines
When item is collected in the spider, it is passed to item Pipeline, and the item Pipeline component processes item in the order defined.
Each item Pipeline is a Python class that implements a simple method, such as deciding whether this item is discarded or stored. Here are some typical applications for item pipeline:
- Validate crawled data (check item contains some fields, say Name field)
- Duplicate checking (and discard)
- To save a crawl result to a file or database
Writing item pipeline is simple, the item pipeline component is a standalone Python class, where the Process_item () method must implement
ImportsomethingclassSomethingpipeline (Object):def __init__(self):#Optional Implementation, initialization of parameters (e.g. open save directory file), etc. #Do something defProcess_item (self, item, spider):#Item (item object)-The item being crawled #Spider (Spider object)-spider that crawls the item #This method must be implemented, and each item pipeline component needs to call the method #must return an item object, the discarded item will not be processed by the pipeline component lock after returnItemdefOpen_spider (self, spider):#Spider (Spider object)-Spider to be opened #Optional implementation, called when Spider is turned on defClose_spider (self, spider):#Spider (Spider object)-Spider to be closed #Optional implementation, called when Spider is closed
Python crawler's scrapy frame structure