Scrapy -- 04

Source: Internet
Author: User

The translation documentation on the official website is of good quality. I am looking at my own translation, tat.

Spider
  • ClassScrapy. Spider. Spider # There are several other manuals on the official website, such as crawlspider. But I still feel that this is a lot of work. The ratio of use on GitHub is 30000 to 4300. If crawlspider is easier to use, please let us know.

  • Spider is the simplest spider. Each other spider must inherit from this class (including other spider that comes with scrapy and your own spider ). Spider does not provide any special functions. It only requests the given start_urls/start_requests and calls the parse method of spider Based on the returned result (resulting responses.

    • Closed (Reason)

    • This function is called when the spider is disabled. This method provides a shortcut to call signals. Connect () to listen to the spider_closed signal.

    • Log (Message[,Level,Component])

    • Use the scrapy. log. MSG () method to record (log) messages. The spider name attribute is automatically included in the log. For more data, seeLogging.

    • Parse (response)

    • When response does not specify a callback function, this method is the default method for scrapy to process the downloaded response.

      Parse is responsible for processing response and returning processed data and (/or) Follow-up URLs. Spider has the same requirements for callback functions of other requests.

      This method and its request callback function must return an iteratable object containing the request and (or) item.

      Parameters: Response(Response)-response for analysis
    • Make_requests_from_url (URL)

    • This method accepts a URL and returns the request object for crawling. This method is called by start_requests () during request initialization and is also used to convert the URL to request.

      If the request object returned by this method is not overwritten by default, parse () is used as the callback function, and the dont_filter parameter is also set to enabled. (For more information, see request ).

    • Start_requests ()(Notes for login verification: The hour is a small number. Be careful with the seal !!)

    • This method must return an iterable object ). This object contains the first request that spider uses to crawl.

      This method is called when the spider starts crawling without specifying a URL. When a URL is specified, make_requests_from_url () is called to create a request object. This method is called only once by scrapy, so you can implement it as a generator.

      The default implementation of this method is to generate a request using the start_urls URL.

      If you want to modify the request object that originally crawled a website, you can override this method. For example, if you need to log on to a website using post at startup, you can write as follows:

      def start_requests(self):    return [scrapy.FormRequest("http://www.example.com/login",                               formdata={‘user‘: ‘john‘, ‘pass‘: ‘secret‘},                               callback=self.logged_in)]def logged_in(self, response):    # here you would extract links to follow and return Requests for    # each of them, with another callback    pass
    • Start_urls

    • URL list. If no specific URL is specified, the spider crawls from the list. Therefore, the URL of the first retrieved page will be one of the list. Subsequent URLs will be extracted from the obtained data.

    • Allowed_domains

    • Optional. Contains the list of domain names allowed to be crawled by Spider ). When offsitemiddleware is enabled, URLs with domain names not in the list will not be followed up. (Be sure to pay attention, otherwise you will not know how to kneel down, the prompt of DEBUG is very bad .)

    • Name

    • A string that defines the name of a spider ). The spider name defines how scrapy locates (and initializes) spider, so it must be unique. However, you can generate multiple identical spider instances without any restrictions. Name is the most important attribute of SPIDER and is required.

      If the spider crawls a single website (single domain), a common practice is to name the spider with the website (domain) (with or without a suffix. For example, if a spider crawls mywebsite.com, the spider is usually named mywebsite.

The preceding figure is composed of scrapy. Spider. The parse method must return the iteratable objects of the request and item. The returned item can be understood. Why do we need to return the request? In fact, scrapy calls the parse method of spider, iterates the obtained object, and executes the callback function of the object. For details about the scrapy architecture, we recommend this blog: http://blog.csdn.net/frylion/article/details/8558538.

With the above features, we can perform recursive crawling and other operations. For example. Like this:

# To be improved, the boss won't let us talk about the details t_tdef parse (self, response): for I in response. XPath (XPath expression): url = chapter URL yield scrapy. request (URL, callback = self. parse_page2) def parse_page2 (self, response): # This wowould log http://www.example.com/some_page.html self. log ("visited % s" % response. URL)


In addition, if you need to capture some information on the parent page and other information on the child page, you can use scrapy. meta field of the request: Meta is a dictionary, which will be copied to the response of the Child parse. meta. You can add items to meta in the parent parse function, and use response. Meta to extract items in the Child parse function. Example:

# This is useless at the moment... Def parse_page1 (self, response): item = myitem () item ['main _ url'] = response. URL request = scrapy. request ("http://www.example.com/some_page.html", callback = self. parse_page2) request. meta ['item'] = item return requestdef parse_page2 (self, response): item = response. meta ['item'] item ['other _ url'] = response. URL return item

To be continued

Scrapy -- 04

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.