Scrapy_redis and Docker implement simple distributed crawler

Source: Internet
Author: User

Brief introduction

Using Scrapy to crawl IT Orange Company Information, used for analysis, to understand all the situation of IT startups, before using Scrapy wrote a default thread is a single instance of 10, in order to prevent the ban IP set the speed of download, more than 30,000 company information climbed more than 1 days to complete, now think of using Distributed crawler to improve efficiency.

Technical tools: Python3.5 scrapy scrapy_redis redis docker1.12 docker-compose kitematic MySQL SQLAlchemy

Preparatory work

Install Docker point here to understand, install;

Pip Install Scrapy Scrapy_redis;

Code Writing code Location

Analyze page Information:

I need to get the details of each "Company" page link and pagination button link;

Unified storage access to the link, provided to multiple spider crawl;

Multiple spider share a link in a redis list;

Directory structure diagram:


juzi_spider.py

# Coding:utf-8

From BS4 import BeautifulSoup
From scrapy.linkextractors import Linkextractor
From scrapy.spiders import crawlspider, rule

From scrapy_redis.spiders import Rediscrawlspider
From Itjuzi_dis.items import Companyitem


Class Itjuzispider (Rediscrawlspider):
name = ' Itjuzi_dis '
Allowed_domains = [' itjuzi.com ']
# start_urls = [' http://www.itjuzi.com/company/157 ']
Redis_key = ' Itjuzicrawler:start_urls '
Rules = [
# Get links to each page
Rule (allow= ('/company\?page=\d+ ')), Link_extractor=linkextractor
# get the details of each company
Rule (link_extractor=linkextractor allow= ('/company/\d+ ')), callback= ' Parse_item ')
]

def parse_item (self, Response):
Soup = BeautifulSoup (response.body, ' lxml ')

.
. omit some processing code
.
Return item
Description

Class inherits Rediscrawlspider, not Crawlspider.

Start_urls is changed to a custom Itjuzicrawler:start_urls, where the itjuzicrawler:start_urls is the key stored in the Redis as all links, and Scrapy_redis is also through R The Edis Lpop method pops up and deletes the link;

db_util.py

Use SQLAlchemy as an ORM tool to automatically create a table structure when a table structure does not exist

middlewares.py

Added a lot of user-agent, each request is randomly used to prevent the site from being screened by User-agent crawler

settings.py

Configure middlewares.py Scrapy_redis Redis links Related information

Deployment

In the "directory structure diagram" above, Dockerfile and Docker-compose.yml

Dockerfile

From python:3.5
ENV Path/usr/local/bin: $PATH
ADD. /code
Workdir/code
RUN Pip Install-r requirements.txt
COPY Spiders.py/usr/local/lib/python3.5/site-packages/scrapy_redis
Cmd/usr/local/bin/scrapy Crawl Itjuzi_dis
Description

Using python3.5 as the underlying mirror image

Set/usr/local/bin to environment variables

Directories that map host and container

Install Requirements.txt

In particular, copy Spiders.py/usr/local/lib/python3.5/site-packages/scrapy_redis, the spiders.py in the host copy to the container scrapy In the _redis installation directory, because Lpop gets the Redis value in the Python2 is the STR type, and in the Python3 is the bytes type, this problem needs to be repaired in the Scrapy_reids, the spiders.py 84th line needs to be modified;

Execute crawl command immediately after startup Scrapy crawl Itjuzi_dis

Docker-compose.yml

Version: ' 2 '
Services
Spider
Build:.
Volumes:
- .:/ Code
Links
-Redis
DEPENDS_ON:
-Redis
Redis:
Image:redis
Ports
-"6,379:6,379"
Description

Use the 2nd version of the Compose description language

Defines the spider and redis two service

Spider is created by default using the Dockerfile of the current directory, Redis is created using redis:latest mirroring, and maps 6379 ports

Start deployment

Start container

Docker-compose up #从 docker-compose.yml create ' container '
Docker-compose scale spider=4 #将 spider This service extends to 4, or the same redis
The creation and operation can be observed in the kitematic GUI tool;


In the absence of start_urls, 4 container reptiles are in a state of hunger and thirst.


Now put the start_urls in the Redis:

Lpush Itjuzicrawler:start_urls Http://www.itjuzi.com/company
4 of reptiles are moving, climbing until start_urls is empty.


Above!

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.