Introduction to Web Crawler--scrapy

Source: Internet
Author: User

This article starts from the actual, shows how to use the Web Crawler. and introduce a popular reptile frame ~

1. The process of web crawler

The so-called web crawler, is to simulate the browser's behavior to visit the site, so as to obtain web information Program. Because it is a program, the speed of getting a webpage can easily exceed the hand speed of single years:). This is usually useful for situations where a large amount of web information is Required.

The process for crawling a Web page is to access the original url, get the returned web page, get a new URL from this page and put it in the queue to crawl, access the new url, ... Loop in Turn. overall, It is a breadth-first process, and of course, the new URL does not have to be obtained from the returned Page.

A simple web crawler should include the following sections:

    1. a URL Queue. our crawler reads the URL from this queue and puts the new URL in the Queue. The most important thing here is to judge the Weight. The simple hash can achieve the purpose of the weight, but in order to save space (the number of URLs is often many), generally use the idea of bloomfilter. The biggest difference between bloomfilter and ordinary hashing algorithm is that bloomfilter only needs a bit to indicate whether an element exists, so it can save Space. Bloomfilter has a small disadvantage, that is, accuracy is not hundred percent: to determine whether an element is already present, there is a very small may be judged to not exist, but no element will certainly be judged to not exist.
    2. Page Crawl Module. need to be able to impersonate the browser to send Requests.
    3. Web Analytics Module. crawling down is the source of the Web page, you can use regular or other methods to extract the information we Need.
    4. the new URL generation Module. generate a new URL and put it in the Queue.

so, The simplest crawler can write this:

Import Queuestart_url="http://www.cnblogs.com/rubinorth" Url_queue= Queue.queue ()# URL Queue url_queue.put (start_url) bloomfilter.put (start_url)# # # # keeps looping until the queue is empty # # #WhileTrue):If Url_queue.size ()> 0:current_url = url_queue.get () # Team First URL page = crawl (current_url)  # crawl for Web Crawl module, page is crawling to the source code next_urls = deal_page (page) # deal_page for web Analysis module, next_urls is a new URL for next_url in next_urls: if not bloomfilter.has (next_url): < Span class= "hljs-comment" ># weight bloomfilter.put (next_url) url_queue.put (next_url) else: break        
2. Why Choose Scrapy

Scrapy is now a more popular reptile framework, the basic principle and the above crawler is the same, but it provides a lot of convenient features.

first, briefly introduce the relationship between the Scrapy modules and the flow of the entire Framework. It is time to sacrifice the classic picture of the Scrapy:

From this picture, scrapy contains the following modules:

    1. Scrapy engine, Main Engines. The entire system is managed while the data flow is being processed, and the triggering of events is also the Responsibility.
    2. spider, our Reptile. The main crawler code is in this section, including initiating requests, processing the returned pages, and so On.
    3. Spider Middleware,spider Middleware. middleware, The main work of the spider sent by the request to do some processing.
    4. scheduler, Scheduler. The above mentioned URL queue is the scheduler in the management, on the one hand to receive requests sent by the spider, put in the queue, on the other hand will be removed from the team to downloader download the Page.
    5. downloader, Downloader. The HTML source of the Web page is downloaded for subsequent page analysis and information Extraction.
    6. Downloader middleware, Downloader Middleware. One of the middleware, which runs both before and after the Web page, can be used to set the header,cookies of the sending request, proxy ip, and handle some of the error returns.
    7. Item pipeline, PIPELINE. After a webpage has been crawled and parsed, its subsequent work on information storage is done in PIPELINE. Of course, This can be done directly in the spider, but in pipeline the whole project is clearly structured.

The above listed Spider,pipeline need to write their own, two middleware need to add their own writing.

Light Introduction to people feel more empty, then let us use scrapy to implement a simple crawler bar.

3. Scrapy Implementation Crawler
scrapy createproject cnblog_project

After creating a Scrapy project with the above command, we will first write the Spider.

ClassCnblogspider(Spider): name =' Cnblog_spider '# reptile Name Allowed_domain = [' Cnblogs.com ']# Allowed DomainDef__init__(self): Self.start_urls = [' Http://www.cnblogs.com/rubinorth ']DefStart_requests(self):return [Request (url, callback=self.parse_page) for URL in self.start_urls] # parse a crawled page and construct a request for the next page def parse_page  "parse:" + Response.url) sel = Selector (response) item = Cnblogitem () # Extract page contents Item[ ' name '] = Sel.xpath ( "//a[@id = ' Header1_headertitle ']/text ()"). extract () [ 0] yield item # Request for next page New_url = Get_new_url (response.body) # based on the source analysis of new links, you need to implement yield Request (new_url, callback=self.parse_page)              

Above is a simple crawler, start_urls is the initial set of URLs (only one of them), start_requests constructs the request according to start_urls, and gives the Scheduler. parse_page, response is the source of the returned page, Cnblogitem is the item component provided by scrapy, which facilitates the structured extraction of data from the source code, and the yield item will give the item to the pipeline; yield Request ( new_url, Callback=self.parse_page) sends a new request and initiates the next Crawl.
In the items.py, just write this:

class CnblogItem(scrapy.Item): name = scrapy.Field()

next, we need to write pipelines.py

class CnblogPipeline(object):    def process_item(self, item, spider): print item[‘name‘] return item

Each pipeline must have process_item this method. above, we simply print out the Name. Return item takes into account that there may be more than one pipeline (return will allow other pipeline to process).
finally, you only need to modify the settings.py to:

...ITEM_PIPELINES = {   ‘yelp_project.pipelines.CnblogPipeline‘: 304,}...

You need to open your own pipeline in Setting.
well, a simple crawler just Finished. Note that we do not use middleware, nor do we need to write our own middleware.
finally, the command line runs:

scrapy crawl cnblog_spider
Resources

How to get started with Python crawlers

Reprint Please specify source: http://www.cnblogs.com/rubinorth/

Introduction to Web Crawler--scrapy

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.