Brief introduction
Using Scrapy to crawl IT Orange Company Information, used for analysis, to understand all the situation of IT startups, before using Scrapy wrote a default thread is a single instance of 10, in order to prevent the ban IP set the speed of download, more than 30,000 company information climbed more than 1 days to complete, now think of using Distributed crawler to improve efficiency.
Technical tools: Python3.5 scrapy scrapy_redis redis docker1.12 docker-compose kitematic MySQL SQLAlchemy
Preparatory work
Install Docker point here to understand, install;
Pip Install Scrapy Scrapy_redis;
Code Writing code Location
Analyze page Information:
I need to get the details of each "Company" page link and pagination button link;
Unified storage access to the link, provided to multiple spider crawl;
Multiple spider share a link in a redis list;
Directory structure diagram:
juzi_spider.py
# Coding:utf-8
From BS4 import BeautifulSoup
From scrapy.linkextractors import Linkextractor
From scrapy.spiders import crawlspider, rule
From scrapy_redis.spiders import Rediscrawlspider
From Itjuzi_dis.items import Companyitem
Class Itjuzispider (Rediscrawlspider):
name = ' Itjuzi_dis '
Allowed_domains = [' itjuzi.com ']
# start_urls = [' http://www.itjuzi.com/company/157 ']
Redis_key = ' Itjuzicrawler:start_urls '
Rules = [
# Get links to each page
Rule (allow= ('/company\?page=\d+ ')), Link_extractor=linkextractor
# get the details of each company
Rule (link_extractor=linkextractor allow= ('/company/\d+ ')), callback= ' Parse_item ')
]
def parse_item (self, Response):
Soup = BeautifulSoup (response.body, ' lxml ')
.
. omit some processing code
.
Return item
Description
Class inherits Rediscrawlspider, not Crawlspider.
Start_urls is changed to a custom Itjuzicrawler:start_urls, where the itjuzicrawler:start_urls is the key stored in the Redis as all links, and Scrapy_redis is also through R The Edis Lpop method pops up and deletes the link;
db_util.py
Use SQLAlchemy as an ORM tool to automatically create a table structure when a table structure does not exist
middlewares.py
Added a lot of user-agent, each request is randomly used to prevent the site from being screened by User-agent crawler
settings.py
Configure middlewares.py Scrapy_redis Redis links Related information
Deployment
In the "directory structure diagram" above, Dockerfile and Docker-compose.yml
Dockerfile
From python:3.5
ENV Path/usr/local/bin: $PATH
ADD. /code
Workdir/code
RUN Pip Install-r requirements.txt
COPY Spiders.py/usr/local/lib/python3.5/site-packages/scrapy_redis
Cmd/usr/local/bin/scrapy Crawl Itjuzi_dis
Description
Using python3.5 as the underlying mirror image
Set/usr/local/bin to environment variables
Directories that map host and container
Install Requirements.txt
In particular, copy Spiders.py/usr/local/lib/python3.5/site-packages/scrapy_redis, the spiders.py in the host copy to the container scrapy In the _redis installation directory, because Lpop gets the Redis value in the Python2 is the STR type, and in the Python3 is the bytes type, this problem needs to be repaired in the Scrapy_reids, the spiders.py 84th line needs to be modified;
Execute crawl command immediately after startup Scrapy crawl Itjuzi_dis
Docker-compose.yml
Version: ' 2 '
Services
Spider
Build:.
Volumes:
- .:/ Code
Links
-Redis
DEPENDS_ON:
-Redis
Redis:
Image:redis
Ports
-"6,379:6,379"
Description
Use the 2nd version of the Compose description language
Defines the spider and redis two service
Spider is created by default using the Dockerfile of the current directory, Redis is created using redis:latest mirroring, and maps 6379 ports
Start deployment
Start container
Docker-compose up #从 docker-compose.yml create ' container '
Docker-compose scale spider=4 #将 spider This service extends to 4, or the same redis
The creation and operation can be observed in the kitematic GUI tool;
In the absence of start_urls, 4 container reptiles are in a state of hunger and thirst.
Now put the start_urls in the Redis:
Lpush Itjuzicrawler:start_urls Http://www.itjuzi.com/company
4 of reptiles are moving, climbing until start_urls is empty.
Above!