Python crawler's scrapy frame structure

Last Update:2017-08-08 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

SCRAPY Project Structure

 scrapy. Cfgmyproject/__init__. PY items. PY pipelines. PY settings. PY spiders/__init__. PY spider1. PY spider2. Py  
 File Description:

scrapy.cfg: Configuration file for Project
myproject/: The project's Python module, where the code will be referenced
myproject/items.py: Target file for project
myproject/pipelines.py: Pipeline file for Project
myproject/settings: Project's settings file
myproject/spiders/: Storage Crawler Code Directory

 Project step

 One, clear target

open items.py file
< The span class= "O" >items define structured data fields to hold the crawled data, a bit like the dict in Python, providing some additional protection from the error
Next, create a custom item subclass, and build the item model

For example:

Import scrapy class Beautyitem (scrapy. Item):    = scrapy. Field ()    = scrapy. Field ()    = scrapy. Field ()

Second, the production of reptiles

Enter the command in the current directory

" Allowed_domains "

　　A crawler file named ' Spider_name ' will be created under the myproject/spiders/directory, and the scope of the crawler (Allowed_domains, which is used to limit the crawl data source to the domain name)

spider_name . py file in this directory, the following code is added by default:

Import scrapy class Spider_name (scrapy. Spider):    "spider_name"    = ["allowed_domains  "]    = ['http://www.allowed_domains']     def Parse (self, Response):         Pass

In fact, we can create the crawler files and write the above code, but the use of commands can eliminate the hassle of writing fixed code

To create a spider, you must use Scrapy. The spider class creates a subclass and determines two mandatory properties and a method

Name = "": Crawler identification name must be unique, different crawlers must define different names
Allowed_domains = []: Is the domain name range of the search, that is, the crawler's constrained area, which specifies that the crawler crawl only the page under this domain name, the non-existent URL will be ignored (non-essential attributes)
Start_urls = []: The list of URLs crawled. The crawler starts crawling data from here, so the first data downloaded will start with these URLs, and the other sub-URLs will be generated from the start URL inheritance
Parse (self, Response): Parses the method, each initial URL completes the download will be called, when the call passes from each URL returns the response object as the unique parameter, the main function is as follows:

Responsible for parsing the returned web page data (response.body), extracting structured data (Generate item)
Generate a URL request that requires the next page

Save data

Scrapy the simplest way to save information is mainly four,-o output the file in the specified format, the command is as follows:

# JSON format scrapy crawl spider_name-o file_name.json#  JSON line format scrapy crawl Spider_ Name-o file_name.jsonl#  csv comma expression, available in Excel open scrapy crawl spider_name-o file_name.csv  #  XML format scrapy crawl Spider_name-o file_name.xml

Item Pipelines

When item is collected in the spider, it is passed to item Pipeline, and the item Pipeline component processes item in the order defined.

Each item Pipeline is a Python class that implements a simple method, such as deciding whether this item is discarded or stored. Here are some typical applications for item pipeline:

Validate crawled data (check item contains some fields, say Name field)
Duplicate checking (and discard)
To save a crawl result to a file or database

Writing item pipeline is simple, the item pipeline component is a standalone Python class, where the Process_item () method must implement

ImportsomethingclassSomethingpipeline (Object):def __init__(self):#Optional Implementation, initialization of parameters (e.g. open save directory file), etc.        #Do something    defProcess_item (self, item, spider):#Item (item object)-The item being crawled        #Spider (Spider object)-spider that crawls the item        #This method must be implemented, and each item pipeline component needs to call the method        #must return an item object, the discarded item will not be processed by the pipeline component lock after        returnItemdefOpen_spider (self, spider):#Spider (Spider object)-Spider to be opened        #Optional implementation, called when Spider is turned on    defClose_spider (self, spider):#Spider (Spider object)-Spider to be closed        #Optional implementation, called when Spider is closed

Python crawler's scrapy frame structure

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Python crawler's scrapy frame structure

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Python crawler's scrapy frame structure

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support