Python crawler's scrapy frame structure

Source: Internet
Author: User

SCRAPY Project Structure

 scrapy. Cfgmyproject/__init__. PY items. PY pipelines. PY settings. PY spiders/__init__. PY spider1. PY spider2. Py  
File Description:
  • < Span class= "o" > < Span class= "o" >scrapy.cfg: Configuration file for Project
  • < Span class= "o" > < Span class= "o" >myproject/: The project's Python module, where the code will be referenced
  • < Span class= "o" > < Span class= "o" >myproject/items.py: Target file for project
  • < Span class= "o" > < Span class= "o" >myproject/pipelines.py: Pipeline file for Project
  • < Span class= "o" > < Span class= "o" >myproject/settings: Project's settings file
  • myproject/spiders/: Storage Crawler Code Directory

< SPAN class= "FM" > < Span class= "o" > Project step

< SPAN class= "FM" > < Span class= "o" > One, clear target

  1. < Span class= "o" > < Span class= "o" > open items.py file
  2. < Span class= "o" > < The span class= "O" >items define structured data fields to hold the crawled data, a bit like the dict in Python, providing some additional protection from the error
  3. < Span class= "o" > < Span class= "o" >
  4. < Span class= "o" > < Span class= "o" > Next, create a custom item subclass, and build the item model

< Span class= "o" > < Span class= "o" > For example:

Import scrapy class Beautyitem (scrapy. Item):    = scrapy. Field ()    = scrapy. Field ()    = scrapy. Field ()

Second, the production of reptiles

    • Enter the command in the current directory
" Allowed_domains "

  A crawler file named ' Spider_name ' will be created under the myproject/spiders/directory, and the scope of the crawler (Allowed_domains, which is used to limit the crawl data source to the domain name)

    • spider_name . py file in this directory, the following code is added by default:
Import scrapy class Spider_name (scrapy. Spider):    "spider_name"    = ["allowed_domains  "]    = ['http://www.allowed_domains']     def Parse (self, Response):         Pass

In fact, we can create the crawler files and write the above code, but the use of commands can eliminate the hassle of writing fixed code

To create a spider, you must use Scrapy. The spider class creates a subclass and determines two mandatory properties and a method

    • Name = "": Crawler identification name must be unique, different crawlers must define different names
    • Allowed_domains = []: Is the domain name range of the search, that is, the crawler's constrained area, which specifies that the crawler crawl only the page under this domain name, the non-existent URL will be ignored (non-essential attributes)
    • Start_urls = []: The list of URLs crawled. The crawler starts crawling data from here, so the first data downloaded will start with these URLs, and the other sub-URLs will be generated from the start URL inheritance
    • Parse (self, Response): Parses the method, each initial URL completes the download will be called, when the call passes from each URL returns the response object as the unique parameter, the main function is as follows:
    1. Responsible for parsing the returned web page data (response.body), extracting structured data (Generate item)
    2. Generate a URL request that requires the next page

Save data

Scrapy the simplest way to save information is mainly four,-o output the file in the specified format, the command is as follows:

# JSON format scrapy crawl spider_name-o file_name.json#  JSON line format scrapy crawl Spider_ Name-o file_name.jsonl#  csv comma expression, available in Excel open scrapy crawl spider_name-o file_name.csv  #  XML format scrapy crawl Spider_name-o file_name.xml

Item Pipelines

When item is collected in the spider, it is passed to item Pipeline, and the item Pipeline component processes item in the order defined.

Each item Pipeline is a Python class that implements a simple method, such as deciding whether this item is discarded or stored. Here are some typical applications for item pipeline:

    • Validate crawled data (check item contains some fields, say Name field)
    • Duplicate checking (and discard)
    • To save a crawl result to a file or database

  

Writing item pipeline is simple, the item pipeline component is a standalone Python class, where the Process_item () method must implement

ImportsomethingclassSomethingpipeline (Object):def __init__(self):#Optional Implementation, initialization of parameters (e.g. open save directory file), etc.        #Do something    defProcess_item (self, item, spider):#Item (item object)-The item being crawled        #Spider (Spider object)-spider that crawls the item        #This method must be implemented, and each item pipeline component needs to call the method        #must return an item object, the discarded item will not be processed by the pipeline component lock after        returnItemdefOpen_spider (self, spider):#Spider (Spider object)-Spider to be opened        #Optional implementation, called when Spider is turned on    defClose_spider (self, spider):#Spider (Spider object)-Spider to be closed        #Optional implementation, called when Spider is closed

Python crawler's scrapy frame structure

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.