Using Scrapy-redis framework to implement distributed crawler based on Python

Source: Internet
Author: User
Tags install redis redis server

Note: This article is on the basis of http://www.111cn.net/sys/CentOS/63645.htm,http://www.cnblogs.com/kylinlin/p/5198233.html to change! Copyright belongs to Alex.shu,kylinlin.

1. First introduce: Scrapy-redis frame 

Scrapy-redis: A three-party, Redis-based distributed crawler framework that works with Scrapy, allowing crawlers to have a distributed crawl capability. GitHub Address: Https://github.com/darkrho/scrapy-redis,

MongoDB, MySQL, or other databases: for different types of data, different database storage can be selected according to the specific requirements. Structured data can be used to save space in MySQL, unstructured, text and other data can use non-relational data such as MongoDB to improve access speed. The specific choice can be Baidu Google, there are a lot of SQL and NoSQL contrast article.

2. Introduce the principle of distribution:

Scrapy-redis implementation of the distributed, in fact, from the principle is very simple, here for the convenience of description, we call our core server master, and the machine used to run the crawler called Slave.

We know that using the Scrapy framework to crawl the Web page, we need to give it some start_urls first, the crawler first visit the start_urls inside the URL, and then according to our specific logic, the elements inside, or other two-level, three-level page crawl. To realize the distribution, we just need to make a fuss in this starts_urls.

We set up a Redis database on master (note that this database is used only as a URL store, does not care about crawling specific data, does not confuse with subsequent MongoDB or MySQL), and creates a separate list field for each type of site that needs to be crawled. Get the address of the URL by setting slave on Scrapy-redis to the master address. The result is that, despite having multiple slave, there is only one place to get the URL, which is the Redis database on the server master.

Also, because of Scrapy-redis's own queue mechanism, slave gets links that do not conflict with each other. In this way, each slave after the crawl task, and then the results are aggregated to the server (this time the data store is no longer redis, but MongoDB or MySQL and other content-specific database)

The advantage of this method is that the program is strong, as long as the path problem is handled, the slave on the program to another machine to run, basically is the copy and paste things.

3. The implementation of distributed crawler:

1. Use two machines, one is Win10, one is CENTOS7 (see http://www.111cn.net/sys/CentOS/63645.htm for details), deploy scrapy on two machines separately for distributed crawl a website

The IP address of the 2.CENTOS7 is 192.168.1.112, used as the master side of Redis, Win10 machine as Slave

The 3.master crawler runtime will encapsulate the extracted URL into a database in Redis: "dmoz:requests", download the request from the database, and then store the contents of the Web page in another Redis database " Dmoz:items "

4.slave remove the request to be crawled from Master's Redis and send the page back to master's Redis after downloading the Web page

5. Repeat the above 3 and 4 until the "dmoz:requests" database in the master Redis is empty, and then write the "Dmoz:items" database in the master Redis to MongoDB

Reids in 6.master also has a data "dmoz:dupefilter" is a fingerprint used to store crawled URLs (the result of a URL operation using a hash function), which prevents repeated crawls

Installation of the 4.scrapy-redis framework:

    

Installing Redis (http://blog.fens.me/linux-redis-install/)

Windows installation Redis

: Https://github.com/rgl/redis/downloads

Select the latest version and the corresponding version of your PC to download and install

After the installation is complete,

command to run Redis server: Redis-server.exe in the installation directory

command to run the Redis client: Redis-cli.exe in the installation directory

Centos7 Installing Redis

Run command directly: Yum installs Redis-y, and starts Redis server by default after installation is complete

When the installation is complete, Redis is not remotely connected by default, and you are modifying the configuration file/etc/redis.conf

#注释bind #bind 127.0.0.1

After modification, restart the Redis server

Systemctl Restart Redis

command to start the Redis server in a CENTOS7 environment: Systemctl start Redis, command to start the client: REDIS-CLI

If you want to increase the access password for Redis, modify the configuration file/etc/redis.conf

#取消注释requirepassrequirepass Redisredis  # Redisredis is the password (remember to change it)

After the password has been added, the command to start the client becomes: Redis-cli-a Redisredis

Test if you can log in remotely

Use the Windows Command window to enter the Redis installation directory and use the command to remotely connect Centos7 Redis:

Redis-cli-h 192.168.1.112-p 6379

Test if you can read the master Redis on this machine

Read whether the data is available on the remote machine

You can be confident that the Redis installation is complete

Install Deployment Scrapy-redis

Install the Scrapy-redis command (Https://github.com/rolando/scrapy-redis)

Pip Install Scrapy-redis

Deployment Scrapy-redis:

Slave: At the end of the settings.py file on Windows, add the following line

Redis_url = ' redis://192.168.1.112:6379 '

Master End: Add the following two lines to the settings.py file on Centos7

Redis_host = ' localhost ' redis_port = 6379

After you have configured the remote Redis address in Windows, start two crawlers (with no sequential restrictions on the boot crawler), and view Redis on Windows You can see that the crawler running on Windows actually gets the request from the remote Reids (because there is no local Redis)

This confirms it. Scrapy-redis installation Configuration Complete

Use Redis-dump to export Redis data for viewing (optional)

Installing Redis-dump (Https://github.com/delano/redis-dump) on the CENTOS7

Yum-y install gcc ruby-devel rubygems Compass Gem

Modifying the RVM installation source (http://genepeng.com/index.php/346)

Gem sources--remove https://rubygems.org/gem sources-a https://ruby.taobao.org/gem sources-lgem Install redis-dump-y

After running the DMOZ in example, connect to Redis and view the following three databases generated, and each value corresponds to the following type

Use the Redis-dump command (redis-dump-u 127.0.0.1:6379 > Db_full.json) on CENTOS7 to export the database and view the data stored there (I only extracted the first few of each database here)

Is the content crawled in the database "Dmoz:items" above.

Importing crawled data into MongoDB

When the crawler is finished, run process_items.py to read the "Dmoz:items" in the Redis in master to the JSON, so if you want to store the item in MongoDB, you should modify the Process_ items.py file, as follows

#!/usr/bin/env python#-*-coding:utf-8-*-import jsonimport redisimport pymongodef Main ():    # r = Redis. Redis ()    r = Redis. Redis (host= ' 192.168.1.112 ', port=6379,db=0)    client = Pymongo. Mongoclient (host= ' localhost ', port=27017)    db = client[' dmoz ']    sheet = db[' sheet '] while    True:        #  Process queue as FIFO, change ' blpop ' to ' brpop ' to process as LIFO        source, data = R.blpop (["Dmoz:items"])        item = Json.loads (data)        Sheet.insert (item)        try:            print u "Processing:% (name) s <% (link) s>"% Item        Except keyerror:            print U "Error procesing:%r"% itemif __name__ = ' __main__ ':    Main ()

You can actually run the process_items.py file while running on the crawler side.

Note: If you re-run the crawler, remember to empty the Redis on master because the database "Dmoz:dupefilter" in master is used to filter duplicate requests

192.168.1.112:6379> Flushdb

Using Scrapy-redis framework to implement distributed crawler based on Python

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.