Scrapy_redis and Docker implement simple distributed crawler

Last Update:2017-01-13 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Brief introduction

Using Scrapy to crawl IT Orange Company Information, used for analysis, to understand all the situation of IT startups, before using Scrapy wrote a default thread is a single instance of 10, in order to prevent the ban IP set the speed of download, more than 30,000 company information climbed more than 1 days to complete, now think of using Distributed crawler to improve efficiency.

Technical tools: Python3.5 scrapy scrapy_redis redis docker1.12 docker-compose kitematic MySQL SQLAlchemy

Preparatory work

Install Docker point here to understand, install;

Pip Install Scrapy Scrapy_redis;

Code Writing code Location

Analyze page Information:

I need to get the details of each "Company" page link and pagination button link;

Unified storage access to the link, provided to multiple spider crawl;

Multiple spider share a link in a redis list;

Directory structure diagram:

juzi_spider.py

# Coding:utf-8

From BS4 import BeautifulSoup
From scrapy.linkextractors import Linkextractor
From scrapy.spiders import crawlspider, rule

From scrapy_redis.spiders import Rediscrawlspider
From Itjuzi_dis.items import Companyitem

Class Itjuzispider (Rediscrawlspider):
name = ' Itjuzi_dis '
Allowed_domains = [' itjuzi.com ']
# start_urls = [' http://www.itjuzi.com/company/157 ']
Redis_key = ' Itjuzicrawler:start_urls '
Rules = [
# Get links to each page
Rule (allow= ('/company\?page=\d+ ')), Link_extractor=linkextractor
# get the details of each company
Rule (link_extractor=linkextractor allow= ('/company/\d+ ')), callback= ' Parse_item ')
]

def parse_item (self, Response):
Soup = BeautifulSoup (response.body, ' lxml ')

.
. omit some processing code
.
Return item
Description

Class inherits Rediscrawlspider, not Crawlspider.

Start_urls is changed to a custom Itjuzicrawler:start_urls, where the itjuzicrawler:start_urls is the key stored in the Redis as all links, and Scrapy_redis is also through R The Edis Lpop method pops up and deletes the link;

db_util.py

Use SQLAlchemy as an ORM tool to automatically create a table structure when a table structure does not exist

middlewares.py

Added a lot of user-agent, each request is randomly used to prevent the site from being screened by User-agent crawler

settings.py

Configure middlewares.py Scrapy_redis Redis links Related information

Deployment

In the "directory structure diagram" above, Dockerfile and Docker-compose.yml

Dockerfile

From python:3.5
ENV Path/usr/local/bin: $PATH
ADD. /code
Workdir/code
RUN Pip Install-r requirements.txt
COPY Spiders.py/usr/local/lib/python3.5/site-packages/scrapy_redis
Cmd/usr/local/bin/scrapy Crawl Itjuzi_dis
Description

Using python3.5 as the underlying mirror image

Set/usr/local/bin to environment variables

Directories that map host and container

Install Requirements.txt

In particular, copy Spiders.py/usr/local/lib/python3.5/site-packages/scrapy_redis, the spiders.py in the host copy to the container scrapy In the _redis installation directory, because Lpop gets the Redis value in the Python2 is the STR type, and in the Python3 is the bytes type, this problem needs to be repaired in the Scrapy_reids, the spiders.py 84th line needs to be modified;

Execute crawl command immediately after startup Scrapy crawl Itjuzi_dis

Docker-compose.yml

Version: ' 2 '
Services
Spider
Build:.
Volumes:
- .:/ Code
Links
-Redis
DEPENDS_ON:
-Redis
Redis:
Image:redis
Ports
-"6,379:6,379"
Description

Use the 2nd version of the Compose description language

Defines the spider and redis two service

Spider is created by default using the Dockerfile of the current directory, Redis is created using redis:latest mirroring, and maps 6379 ports

Start deployment

Start container

Docker-compose up #从 docker-compose.yml create ' container '
Docker-compose scale spider=4 #将 spider This service extends to 4, or the same redis
The creation and operation can be observed in the kitematic GUI tool;

In the absence of start_urls, 4 container reptiles are in a state of hunger and thirst.

Now put the start_urls in the Redis:

Lpush Itjuzicrawler:start_urls Http://www.itjuzi.com/company
4 of reptiles are moving, climbing until start_urls is empty.

Above!

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More