Note: This article is on the basis of http://www.111cn.net/sys/CentOS/63645.htm,http://www.cnblogs.com/kylinlin/p/5198233.html to change! Copyright belongs to Alex.shu,kylinlin.
1. First introduce: Scrapy-redis frame
Scrapy-redis: A three-party, Redis-based distributed crawler framework that works with Scrapy, allowing crawlers to have a distributed crawl capability. GitHub Address: Https://github.com/darkrho/scrapy-redis,
MongoDB, MySQL, or other databases: for different types of data, different database storage can be selected according to the specific requirements. Structured data can be used to save space in MySQL, unstructured, text and other data can use non-relational data such as MongoDB to improve access speed. The specific choice can be Baidu Google, there are a lot of SQL and NoSQL contrast article.
2. Introduce the principle of distribution:
Scrapy-redis implementation of the distributed, in fact, from the principle is very simple, here for the convenience of description, we call our core server master, and the machine used to run the crawler called Slave.
We know that using the Scrapy framework to crawl the Web page, we need to give it some start_urls first, the crawler first visit the start_urls inside the URL, and then according to our specific logic, the elements inside, or other two-level, three-level page crawl. To realize the distribution, we just need to make a fuss in this starts_urls.
We set up a Redis database on master (note that this database is used only as a URL store, does not care about crawling specific data, does not confuse with subsequent MongoDB or MySQL), and creates a separate list field for each type of site that needs to be crawled. Get the address of the URL by setting slave on Scrapy-redis to the master address. The result is that, despite having multiple slave, there is only one place to get the URL, which is the Redis database on the server master.
Also, because of Scrapy-redis's own queue mechanism, slave gets links that do not conflict with each other. In this way, each slave after the crawl task, and then the results are aggregated to the server (this time the data store is no longer redis, but MongoDB or MySQL and other content-specific database)
The advantage of this method is that the program is strong, as long as the path problem is handled, the slave on the program to another machine to run, basically is the copy and paste things.
3. The implementation of distributed crawler:
1. Use two machines, one is Win10, one is CENTOS7 (see http://www.111cn.net/sys/CentOS/63645.htm for details), deploy scrapy on two machines separately for distributed crawl a website
The IP address of the 2.CENTOS7 is 192.168.1.112, used as the master side of Redis, Win10 machine as Slave
The 3.master crawler runtime will encapsulate the extracted URL into a database in Redis: "dmoz:requests", download the request from the database, and then store the contents of the Web page in another Redis database " Dmoz:items "
4.slave remove the request to be crawled from Master's Redis and send the page back to master's Redis after downloading the Web page
5. Repeat the above 3 and 4 until the "dmoz:requests" database in the master Redis is empty, and then write the "Dmoz:items" database in the master Redis to MongoDB
Reids in 6.master also has a data "dmoz:dupefilter" is a fingerprint used to store crawled URLs (the result of a URL operation using a hash function), which prevents repeated crawls
Installation of the 4.scrapy-redis framework:
Installing Redis (http://blog.fens.me/linux-redis-install/)
Windows installation Redis
: Https://github.com/rgl/redis/downloads
Select the latest version and the corresponding version of your PC to download and install
After the installation is complete,
command to run Redis server: Redis-server.exe in the installation directory
command to run the Redis client: Redis-cli.exe in the installation directory
Centos7 Installing Redis
Run command directly: Yum installs Redis-y, and starts Redis server by default after installation is complete
When the installation is complete, Redis is not remotely connected by default, and you are modifying the configuration file/etc/redis.conf
#注释bind #bind 127.0.0.1
After modification, restart the Redis server
Systemctl Restart Redis
command to start the Redis server in a CENTOS7 environment: Systemctl start Redis, command to start the client: REDIS-CLI
If you want to increase the access password for Redis, modify the configuration file/etc/redis.conf
#取消注释requirepassrequirepass Redisredis # Redisredis is the password (remember to change it)
After the password has been added, the command to start the client becomes: Redis-cli-a Redisredis
Test if you can log in remotely
Use the Windows Command window to enter the Redis installation directory and use the command to remotely connect Centos7 Redis:
Redis-cli-h 192.168.1.112-p 6379
Test if you can read the master Redis on this machine
Read whether the data is available on the remote machine
You can be confident that the Redis installation is complete
Install Deployment Scrapy-redis
Install the Scrapy-redis command (Https://github.com/rolando/scrapy-redis)
Pip Install Scrapy-redis
Deployment Scrapy-redis:
Slave: At the end of the settings.py file on Windows, add the following line
Redis_url = ' redis://192.168.1.112:6379 '
Master End: Add the following two lines to the settings.py file on Centos7
Redis_host = ' localhost ' redis_port = 6379
After you have configured the remote Redis address in Windows, start two crawlers (with no sequential restrictions on the boot crawler), and view Redis on Windows You can see that the crawler running on Windows actually gets the request from the remote Reids (because there is no local Redis)
This confirms it. Scrapy-redis installation Configuration Complete
Use Redis-dump to export Redis data for viewing (optional)
Installing Redis-dump (Https://github.com/delano/redis-dump) on the CENTOS7
Yum-y install gcc ruby-devel rubygems Compass Gem
Modifying the RVM installation source (http://genepeng.com/index.php/346)
Gem sources--remove https://rubygems.org/gem sources-a https://ruby.taobao.org/gem sources-lgem Install redis-dump-y
After running the DMOZ in example, connect to Redis and view the following three databases generated, and each value corresponds to the following type
Use the Redis-dump command (redis-dump-u 127.0.0.1:6379 > Db_full.json) on CENTOS7 to export the database and view the data stored there (I only extracted the first few of each database here)
Is the content crawled in the database "Dmoz:items" above.
Importing crawled data into MongoDB
When the crawler is finished, run process_items.py to read the "Dmoz:items" in the Redis in master to the JSON, so if you want to store the item in MongoDB, you should modify the Process_ items.py file, as follows
#!/usr/bin/env python#-*-coding:utf-8-*-import jsonimport redisimport pymongodef Main (): # r = Redis. Redis () r = Redis. Redis (host= ' 192.168.1.112 ', port=6379,db=0) client = Pymongo. Mongoclient (host= ' localhost ', port=27017) db = client[' dmoz '] sheet = db[' sheet '] while True: # Process queue as FIFO, change ' blpop ' to ' brpop ' to process as LIFO source, data = R.blpop (["Dmoz:items"]) item = Json.loads (data) Sheet.insert (item) try: print u "Processing:% (name) s <% (link) s>"% Item Except keyerror: print U "Error procesing:%r"% itemif __name__ = ' __main__ ': Main ()
You can actually run the process_items.py file while running on the crawler side.
Note: If you re-run the crawler, remember to empty the Redis on master because the database "Dmoz:dupefilter" in master is used to filter duplicate requests
192.168.1.112:6379> Flushdb
Using Scrapy-redis framework to implement distributed crawler based on Python