Scrapy-request and response (Request and response) _scrapy

Source: Internet
Author: User
Tags form post html form html form post http post xpath
Requests and responses

Scrapy request and response objects are used to crawl Web sites.

Typically, the request object is generated in the crawler and passed to the system until they reach the download program, which executes the request and returns a response object that returns to the requesting crawler.

The above passage is quite awkward, have the web experience classmate, should all understand, do not understand see the following figure probably understand.

Reptile->request: Create
request->response: Get Download Data
response-> crawler: Data

The two class request and response classes have subclasses that add features that are not required in the base class. These are described in the following request subclass and response subclasses. Request objects

Class Scrapy.http.Request (url[, callback, method= ' get ', headers, body, cookies, meta, encoding= ' Utf-8 ', priority=0 _filter=false, Errback])

A Request object represents an HTTP request, which is usually generated in the crawler and executed by the download to generate response.

Parameters: URL (string)-the URL callback (callable) for this request-will use the response of this request (once downloaded) as the function called by its first argument. For more information, see passing additional data to a callback function below. If the request does not specify a callback, parse () uses the Spider method. Note that if an exception is thrown during processing, the Errback is invoked. Methods (String)-the HTTP method for this request. Default is ' get '. Meta (dict)-The initial value of the property Request.meta. If given, the dict passed in this parameter will be shallow copied. Body (str or Unicode)-the requestor. If Unicode passes a, it is encoded as STR using the passed encoding (default is Utf-8). If the body is not given, an empty string is stored. Regardless of the type of this parameter, the stored final value will be a str (not Unicode or none). Headers (dict)-The head of this request. The Dict value can be a string (for a single value header) or a list (for multivalued headers). If none is passed as a value, the HTTP headers are not sent.

Cookie (dict or list)-Request cookie. These can be sent in two ways. Use dict:

Request_with_cookies = Request (url= "http://www.example.com",
                               cookies={' currency ': ' USD ', ' Country ': ' UY '})
    * Use list:

    '
    request_with_cookies = Request (url= "http://www.example.com",
                                   cookies=[{' name ': ' Currency ',
                                            ' value ': ' USD ', '
                                            domain ': ' example.com ',
                                            ' path ': '/currency '}]
    ```

The latter form allows you to customize the properties domain and path properties of the cookie. This is only useful if you save the cookie for subsequent requests.

When some Web sites return cookies (in response), the cookies are stored in the domain's cookie and sent again in future requests. This is a typical behavior of any regular web browser. However, if, for some reason, you want to avoid merging with existing cookies, you can instruct Scrapy to manipulate Request.meta by setting the Dont_merge_cookies keyword to true.

Sample requests for not merging cookies:

Request_with_cookies = Request (url= "http://www.example.com",
                               cookies={' currency ': ' USD ', ' Country ': ' UY '},
                               meta={' dont_merge_cookies ': True})

For more information, see Cookiesmiddleware. Encoding (string)-the encoding of this request (default is ' Utf-8 '). This encoding will be used to encode the URL as a percentage and convert the body to STR (if given Unicode). Priority (int)-The priority of this request (default is 0). The scheduler uses precedence to define the order in which requests are processed. Requests with higher priority values are executed earlier. Negative values are allowed to indicate a relative low priority. Dont_filter (Boolean)-indicates that this request should not be filtered by the scheduler. Used when you want to perform the same request more than once, ignoring the duplicate filter. Use it carefully, or you'll get into the crawl cycle. The default is False.

Errback (callable)-the function to invoke if any exception is thrown when the request is processed. This includes pages such as failed 404 HTTP errors. It receives a twisted failure instance as the first argument. For more information, see Using Errbacks to catch exceptions in request processing.

Url
A string containing the URL of this request. Keep in mind that this property contains an escaped URL, so it may be different from the URL that is passed in the constructor.

This property is read-only. The URL replace () used by the change request.

Method
A string representing the HTTP method in the request. This guarantee is in uppercase. For example: "Get", "POST", "put" and so on

Headers
A dictionary-like object that contains the request headers.

Body
Str that contains the body of the request.

This property is read-only. The body replace () used by the change request. Meta
A dictionary that contains any metadata for this request. This dict is empty for new requests and is usually populated by different scrapy components (extenders, middleware, etc.). Therefore, the data contained in this dict depends on the extensions that you enable.

For a list of special meta keys for scrapy recognition, see Request.meta Special keys.

When cloning a request using the or method, this dict is shallow copied and can also be accessed from the property in your reptile. Copy () replace () Response.meta

Copy ()
Returns a new request, which is a copy of this request. See also: passing additional data to a callback function.

replace ([URL, method, headers, body, cookies, meta, encoding, Dont_filter, callback, Errback])
Returns a Request object that has the same member, except for a member assigned a new value by any of the specified keyword parameters. The property Request.meta is the default copy (unless the new value is in the given meta parameter). See also pass additional data to the callback function. Passing additional data to a callback function

The requested callback is the function that will be invoked when the response to the request is downloaded. The callback function is invoked using the downloaded Response object as its first parameter.

Cases:

def parse_page1 (self, Response): Return
    scrapy. Request ("http://www.example.com/some_page.html",
                          callback=self.parse_page2)

def parse_page2 (self, Response):
    # This would log http://www.example.com/some_page.html
    Self.logger.info ("visited%s", Response.url)

In some cases, you may be interested in passing parameters to these callback functions so that you can receive them later in the second callback. You can use the Request.meta property.

Here's an example of using this mechanism to pass items to populate different fields from different pages:

def parse_page1 (self, Response):
    item = myitem ()
    item[' main_url '] = response.url
    request = Scrapy. Request ("http://www.example.com/some_page.html",
                             callback=self.parse_page2)
    request.meta[' Item ' = Item
    yield Request

def parse_page2 (self, Response):
    item = response.meta[' item ']
    item[' other_url '] = Response.url
    Yield Item
Using Errbacks to catch exceptions in request processing

The requested errback is the function that is invoked when the exception is handled.

It receives a twisted failure instance as the first parameter and can be used to track connection creation timeouts, DNS errors, and so on.

Here's an example crawler that logs all errors and catches some specific bugs if needed:

Import scrapy from Scrapy.spidermiddlewares.httperror import httperror from twisted.internet.error Import Dnslookuperror from Twisted.internet.error import Timeouterror, Tcptimedouterror class Errbackspider (scrapy. Spider): name = "Errback_example" start_urls = ["http://www.httpbin.org/", # HTTP expect Ed "http://www.httpbin.org/status/404", # not found error ' http://www.httpbin.org/status/500 ', # ser Ver issue "http://www.httpbin.org:12345/", # non-responding host, timeout expected "http://www.http
            httpbinbin.org/", # DNS Error expected] def start_requests (self): for u in Self.start_urls: Yield scrapy.
                                    Request (U, Callback=self.parse_httpbin, Errback=self.errback_httpbin, Dont_filter=true def parse_httpbin (self, Response): Self.logger.info (' Got Successful res Ponse from {} '. Format (response.urL) # do something useful ... def errback_httpbin (self, Failure): # Log all Failures Self . Logger.error (Repr (Failure)) # in the case of you want to do something special for some errors, # your may need T He failure ' s type:if failure.check (httperror): # This exceptions come from Httperror spider Are # you can get the non-200 response response = Failure.value.response SELF.LOGGER.E Rror (' Httperror on%s ', Response.url) elif Failure.check (dnslookuperror): # This is the original requ  EST Request = failure.request self.logger.error (' Dnslookuperror on%s ', Request.url) elif Failure.check (Timeouterror, tcptimedouterror): request = Failure.request self.logger.error (' Timeo Uterror on%s ', Request.url)
Request.meta Special Keys

The Request.meta property can contain any arbitrary data, but some special keys are identified by the scrapy and its built-in extensions.

Those are:

Dont_redirect
dont_retry
handle_httpstatus_list
handle_httpstatus_all
Dont_merge_ Cookies (see Request parameters for the cookies constructor)
Cookiejar
dont_cache
redirect_urls
bindaddress
Dont_obey_ Robotstxt
download_timeout
download_maxsize
download_latency
Proxy
Bindaddress

IP for the outbound IP address that is used to perform the request. Download_timeout

The amount of time, in seconds, that the downloader waits before timing out. See also: Download_timeout. Download_latency

The amount of time used to get a response since the request was started, that is, an HTTP message sent over the network. This meta key is available only when the response has been downloaded. Although most other meta keys are used to control scrapy behavior, this should be read-only. Request Subclass

Here is the request list for the built-in subclasses. You can also subclass it to implement your own custom functionality.

Formrequest objects
The Formrequest class extends the basis of the request's ability to handle HTML forms. It uses the Lxml.html form to populate form fields from the form data of the response object.

Class Scrapy.http.FormRequest (url[, Formdata, ...])

This formrequest class adds the parameters of the new constructor. The remaining parameters are the same as the request class, and there are no records here. Parameters: Formdata (dict or iterable)-is an iteration of a dictionary (or (key,value) tuple that contains HTML form data that will be encoded by the URL and assigned to the requesting body.
The Formrequest object supports methods to request, in addition to the standard following class methods:

Classmethod from_response (response[, Formname=none, Formid=none, formnumber=0, Formdata=none, Formxpath=None, formcss =none, Clickdata=none, Dont_click=false, ...]

Returns a new Formrequest object in which the form field value is pre-filled <form> populated in the HTML element contained in the given response. For an example, see Using Formrequest.from_response () to simulate user logons.

This policy is the default automatic analog click on any viewable form control, such as a. Even though this is quite handy and often desired behavior, sometimes it can lead to problems that are difficult to debug. For example, when you use a form that is populated and/or submitted using JavaScript, the default behavior may not be the most appropriate. To disable this behavior, you can set the parameter to. Also, you can use parameters if you want to change the clicked control instead of disabling it. <input type= "Submit" > From_response () dont_click True Clickdata

Parameters: Response (Responseobject)-Contains the response FormName (string) of the HTML form that will be used to pre-populated the form field-if given, the Name property is used to set the form for this value. Formid (String)-if given, the ID property is used to set the form for this value. Formxpath (String)-if given, the first form that matches the XPath is used. Formcss (String)-if given, the first form of the matching CSS selector will be used. Formnumber (integer)-The number of forms to use when responding to multiple forms. The first (and the default) is 0. Formdata (dict)-the field to be overwritten in the form data. If a field already exists in the response <form> element, its value is overwritten by the value passed in this parameter. Clickdata (dict)-Find the properties that the control is clicked on. If not provided, the form data will be submitted to simulate the click of the first clickable element. In addition to HTML properties, a control can be identified by its NR attribute, which is based on a zero-based index that is entered in relation to other submission tables in the form. Dont_click (Boolean)-if true, the form data will be submitted without clicking on any elements.

Other parameters of this class method are passed directly to the Formrequest constructor.
In the new version 0.10.3: the FormName parameter.
In the new version 0.17: the Formxpath parameter.
New version 1.1.0: the FORMCSS parameter.
New version 1.1.0: the Formid parameter. Request use sample use Formrequest to send data over HTTP POST

If you want to simulate an HTML form post in your crawler and send several key-value fields, you can return a Formrequest object (from your crawler) like this:

return [Formrequest (url= "Http://www.example.com/post/action",
                    formdata={' name ': ' John Doe ', ' Age ': '}
                    ', Callback=self.after_post)]
Use Formrequest.from_response () to simulate user logon

Web sites typically provide pre-filled form fields through elements, such as session-related data or authentication tokens (for login pages). When you make a clip, you need to automatically populate these fields and cover only some of them, such as user names and passwords. You can use the method for this job. Here is an example of a crawler using it: <input type= "hidden" > Formrequest.from_response ()

Import Scrapy

class Loginspider (scrapy. Spider):
    name = ' example.com '
    start_urls = [' http://www.example.com/users/login.php ']

    def parse (self, Response): Return
        scrapy. Formrequest.from_response (
            response,
            formdata={' username ': ' John ', ' Password ': ' Secret '},
            callback= Self.after_login

    def after_login (self, Response):
        # Check Login succeed before going on
        if ' Authentication failed "in Response.body:
            self.logger.error (" Login failed ")
            return

        # Continue scraping With authenticated session ...
Response Object

Class Scrapy.http.Response (url[, status=200, Headers=none, body=b ", Flags=none, Request=none])
An Response object represents the HTTP response, which is usually downloaded (by download) and supplied to the crawler for processing.

Parameter: URL (string)-the URL status (integer) of this response-the HTTP state of the response. The default is 200. Headers (dict)-The head of this response. The Dict value can be a string (for a single value header) or a list (for multivalued headers). Body (str)-Responder. It must be str, not Unicode, unless you use a coded perceptual response subclass such as Textresponse. Flags (list)-is a list of response.flags that contains the initial value of the property. If given, the list will be shallow copied. Request (Requestobject)-The initial value of the property response.request. This represents the request generating this response.

Url
A string containing the URL of the response.

This property is read-only. Change the URL replace () used by the response.

Status
An integer representing the HTTP state of the response. Example: 200, 404.

Headers
The class Dictionary object that contains the response title. You can use Get () to return the first header value with the specified name or getlist () to return all header values with the specified name to access the value. For example, this call will provide you with all the cookies in the title:

Response.headers.getlist (' Set-cookie ')

Body
The body of this reply. Remember that Response.body is always a byte object. If you want the Unicode version to use Textresponse.text (only available in Textresponse and subclasses).

This property is read-only. Change the principal replace () used by the response.

Request
The request generates an object for this response. This property is allocated in the Scrapy engine after the response and request pass through all of the download middleware. In particular, this means:

HTTP redirection causes the original request (the URL before redirection) to be assigned to the redirected response (with the final URL after redirection).
Response.request.url is not always equal to Response.url
This property is available only in crawler code and Spider middleware, but not in downloader middleware (although you have requests that are available by other means) and handler response_downloaded.

Meta
Shortcut Request.meta the property Response.request object (that is, Self.request.meta).

Unlike the Response.request property, the Response.meta property propagates along redirects and retries, so you will get the original attributes that Request.meta sent from your crawler.

can also look at

Request.meta Property

Flags
A list of flags that contain this response. Flags are labels that are used to mark responses. For example: ' Cached ', ' redirected ' and so on. They appear on the string representation of the response (Str method), which is used by the engine for logging.

Copy ()
Returns a new response, which is a copy of this response.

Replace ([URL,STATUS,HEADERS,BODY,REQUEST,FLAGS,CLS])
Returns a response object that has the same member, except for a member assigned a new value by any of the specified keyword parameters. The property Response.meta is the default replication.

Urljoin (URL)
An absolute URL is constructed by combining the response URL with a possible relative URL.

This is a wrapper in Urlparse.urljoin, and it's just an alias to make this call:

Urlparse.urljoin (Response.url, URL)
Response Subclass

Here is a list of the available built-in response subclasses. You can also subclass the response class to achieve your own functionality. Textresponse objects

Class Scrapy.http.TextResponse (url[, encoding[, ...)]

The Textresponse object adds encoding power to the base response class, which means only binary data, such as images, sounds, or any media file.

The Textresponse object supports a new constructor parameter, in addition to the underlying response object. The rest of the functionality is the same as the response class, and there are no records.

Parameter: Encoding (String)-is a string containing the encoding used for this response. If you create an object that Textresponse has a Unicode principal, it will encode using this encoding (remember that the Body property is always a string). If encoding is none (the default), the encoding is found in the response header and body.
Textresponse in addition to standard objects, objects also support the following properties response

Text
Response body, such as Unicode.

The same response.body.decode (response.encoding), but the result is cached after the first call, so you can access response.text multiple times without additional overhead.

Attention
Unicode (Response.body) is not the right way to convert a response body to Unicode: You will use the system default encoding (usually ASCII) rather than the response encoding.

Encoding
The encoded string that contains this response. The encoding is resolved sequentially by trying the following mechanisms: the encoding that is passed in the constructor encoding parameter in the Content-type HTTP header. If this encoding is invalid (that is, unknown), it is ignored and the next resolution mechanism is attempted. The encoding declared in the response body. The Textresponse class does not provide any special functionality. However, Htmlresponse and Xmlresponse classes do. The encoding inferred by looking at the response body. This is a more fragile approach, but it is also the last attempt.

Selector
A selector uses the response as the target instance. The selector is deferred for the first time it is accessed.

The Textresponse object also supports the following methods in addition to standard objects response:

XPath (query)
Shortcut TextResponse.selector.xpath (query):

Response.xpath ('//p ')

CSS (query)
Shortcut TextResponse.selector.css (query):

Response.css (' P ')

Body_as_unicode ()
Same text, but can be used as a method. Keep this method for backward compatibility; Please like Response.text. Htmlresponse objects

Class Scrapy.http.HtmlResponse (URL [, ...])
Subclasses of this htmlresponse class, textresponse This increases the automatic discovery of support meta HTTP-EQUIV properties by viewing HTML encoding. See textresponse.encoding. Xmlresponse objects

Class Scrapy.http.XmlResponse (URL [, ...])
Subclass of this xmlresponse class, textresponse This increases the automatic discovery support by viewing the XML declaration line encoding. See textresponse.encoding.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.