Python Base crawler Framework Scrapy

Source: Internet
Author: User

The instance crawls: http://quotes.toscrape.com/page/1/. New Project

Ii. Clear Objectives

Third, the production of reptiles

defParse (self, Response): l= Itemloader (Item=quotesitem (), response=response) L.add_xpath ('text','//div/span/text ()') L.add_xpath ('author','//div/small/text ()') L.add_xpath ('Tags','//div[@class ='Tags']/a/text ()')            returnL.load_item ()

Let's get to the idea of two concepts:

Built-in processors

Although you can use callable functions as input and output processors, Scrapy provides some common processors. Some processors, such as mapcompose (commonly used for input processors), can combine the results of multiple function executions in order to produce the final output.

Here are some of the built-in processors:

1 Identity

Class Scrapy.loader.processors.Identity

The simplest processor, without any processing, returns the original data directly. No parameters.

.2 Takefirst

Class Scrapy.loader.processors.TakeFirst

Returns the first non-null (non-null/non-empty) value, commonly used for the output processor of a single-value field. No parameters.

Examples are as follows:

>>> from scrapy.loader.processors import TakeFirst>>> proc = TakeFirst()>>> proc([‘‘, ‘one‘, ‘two‘, ‘three‘])‘one‘
.3 Join

Class Scrapy.loader.processors.Join (Separator=u ")

Returns the value after the delimiter is concatenated. The delimiter default is a space. Do not accept loader contexts.

When using the default delimiter, this processor is equivalent to this function:

u‘ ‘.join

Examples of Use:

>>> from scrapy.loader.processors import Join>>> proc = Join()>>> proc([‘one‘, ‘two‘, ‘three‘])u‘one two three‘>>> proc = Join(‘<br>‘)>>> proc([‘one‘, ‘two‘, ‘three‘])u‘one<br>two<br>three‘
4 Compose

Class Scrapy.loader.processors.Compose (*functions, **default_loader_context)

A processor constructed with the combination of a given number of functions. Each input value is passed to the first function, then its output is passed to the second function, and so on, until the last function returns the output of the entire processor.

By default, None processing stops when a value is encountered. This behavior can be changed by passing parameters stop_on_none=False .

Examples of Use:

>>> from scrapy.loader.processors import Compose>>> proc = Compose(lambda v: v[0], str.upper)>>> proc([‘hello‘, ‘world‘])‘HELLO‘

Each function can optionally receive a loader_context parameter.

.5 Mapcompose

Class Scrapy.loader.processors.MapCompose (*functions, **default_loader_context)

Similar to the compose processor, the difference is in how the results of each function are passed internally:

    • The input values are processed by iteration, and each element is passed into the first function individually for processing. The result of the processing is connected by L (concatenate) to form a new iterator, which is passed to the second function, and so on until the last function. The output of the last function is connected to form the output of the processor.

    • Each function can return a value or a list of values, or it can return None (ignored by the next function)

    • This processor provides a convenient way to combine multiple functions that deal with single-valued values. Therefore it is commonly used with the input processor, because the value extracted with the extract () function is a list of Unicode strings.

The following example illustrates how the processor works:

>>> def filter_world(x):...     return None if x == ‘world‘ else x...>>> from scrapy.loader.processors import MapCompose>>> proc = MapCompose(filter_world, unicode.upper)>>> proc([u‘hello‘, u‘world‘, u‘this‘, u‘is‘, u‘scrapy‘])[u‘HELLO, u‘THIS‘, u‘IS‘, u‘SCRAPY‘]

Similar to the compose processor, it can also accept loader context.

6 Selectjmes

Class Scrapy.loader.processors.SelectJmes (Json_path)

Queries the specified JSON path and returns the output. Requires Jmespath (https://github.com/jmespath/jmespath.py) support. Accept one input at a time.

Example:

>>> from scrapy.loader.processors import SelectJmes, Compose, MapCompose>>> proc = SelectJmes("foo") #for direct use on lists and dictionaries>>> proc({‘foo‘: ‘bar‘})‘bar‘>>> proc({‘foo‘: {‘bar‘: ‘baz‘}}){‘bar‘: ‘baz‘}

Used with JSON:

>>> import json>>> proc_single_json_str = Compose(json.loads, SelectJmes("foo"))>>> proc_single_json_str(‘{"foo": "bar"}‘)u‘bar‘>>> proc_json_list = Compose(json.loads, MapCompose(SelectJmes(‘foo‘)))>>> proc_json_list(‘[{"foo":"bar"}, {"baz":"tar"}]‘)[u‘bar‘]

Python Base crawler Framework Scrapy

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.