Build a distributed crawler cluster using Docker swarm

Source: Internet
Author: User
Tags gpg docker run docker swarm b7e992da6bd1b24fae8a285fbbe1bd38&chksm= 8c99ffb8bbee76ae2b6fc5f265fb586edc8ce8e8d67eb0389b5b247c4cde2a063c0d7d9e432b&scene=0&key= b2ddfae992804f5474c3b20abb75e2a5469a814cac9cbb914d843e7b76e1ea6752c8b6fd32fe01dceca2fe2e898436d5691b7190eb90cdca1a9dcd325 Dbb621675cc529c2992bf58e8def79d5a644a71&ascene=1&uin=mjgwmtewndqxng%3d%3d&devicetype= windows-qqbrowser&version=6103000b&lang=zh_cn&pass_ticket=rxcikukx8bc9gufoix05q%2b8w%2fnx7p% 2b9tdnjokqosw4n06lijjrxo0dqyu5hvskazkingname Programmer Little Grey today

This article is reproduced from the public number not heard code

In the process of crawler development, you must have encountered the need to deploy the crawler on multiple servers above the situation. How do you do it now? SSH to each server, use Git to pull down the code, and then run? Code changes, so another server to log on a server to update in turn?

Sometimes crawlers only need to run on a single server, and sometimes they need to run on 200 of servers. How do you switch quickly? A server, a server, log on, switch? Or be smart, set up a modified tag in Redis, only the crawler running on the server with the tag?

A crawler has been deployed on all servers, and now a B crawler, do you have to log on to each server and then deploy again?

If you do, then you should regret not seeing this article earlier. After reading this article, you can:

    • Deploy a new crawler to 50 servers in 2 minutes:

docker build -t localhost:8003/spider:0.01 .docker push localhost:8002/spider:0.01docker service create --name spider --replicas 50 --network host
    • Extend the crawler from 50 servers to 500 servers in 30 seconds:

docker service scale spider=500
    • In 30 seconds, turn off the crawler on all servers in bulk:

docker service scale spider=0
    • Batch update of crawlers on all machines in 1 minutes:

docker build -t localhost:8003/spider:0.02 .docker push localhost:8003/spider:0.02docker service update --image spider

This article will not teach you how to use Docker, so make sure you have some Docker basics to look at this article.

What is Docker swarm?

Docker Swarm is a cluster management module that comes with Docker. He was able to create and manage Docker clusters.

Environment construction

This article will use 3 Ubuntu 18.04 servers for demonstration purposes. The three servers are arranged as follows:

    • master:

    • slave-1:

    • slave-2:

Docker Swarm is a Docker-based module, so first install Docker on 3 servers. After the installation has completed Docker, all operations are done in Docker.

Installing Docker on Master

Install Docker on the master server by executing the following command in turn

apt-get updateapt-get install -y apt-transport-https ca-certificates curl software-properties-commoncurl -fsSL | sudo apt-key add -add-apt-repository "deb [arch=amd64] bionic stable"apt-get updateapt-get install -y docker-ce
Create Manager node

A docker swarm cluster requires the manager node. Now initialize the master server as the manager node for the cluster. Run the following command.

docker swarm init

After the run is complete, you can see the returned results as shown.

In this return result, a command is given:

docker swarm join --token SWMTKN-1-0hqsajb64iynkg8ocp8uruktii5esuo4qiaxmqw2pddnkls9av-dfj7nf1x3vr5qcj4cqiusu4pv

This command needs to be executed in each of the slave nodes (Slave). Now let's record this command.

When initialization is complete, a docker cluster with only 1 servers is available. Execute the following command:

docker node ls

You can see the status of the current cluster as shown in.

To create a private source (optional)

Creating a private source is not a required operation. A private source is required because the project's Docker image may involve company secrets and cannot be uploaded to the Dockerhub public platform. If your image can be publicly uploaded dockerhub, or you already have a private image source that you can use, you can use them directly, skipping this section and the next section.

The private source itself is also a Docker image, which is pulled down first:

docker pull registry:latest

As shown in.

Now start the private source:

docker run -d -p 8003:5000 --name registry -v /tmp/registry:/tmp/registry

As shown in.

In the Start command, the open port is set to port 8003, so the address of the private source is:

hint: This kind of private source is the HTTP mode, and there is no authorization mechanism, so if the public network open, you need to use a firewall to do the IP whitelist, so as to ensure the security of the data.
Allow Docker to use a trusted HTTP private source (optional)

If you set up your own private source using the commands in the above section, you need to configure Docker to let Docker trust it because Docker is not allowed to use the HTTP private source by default.

Configure Docker with the following command:

echo ‘{ "insecure-registries":[""] }‘ >> /etc/docker/daemon.json

Then use the following command to restart Docker.

systemctl restart docker

As shown in.

After the reboot is complete, the manager node is configured.

Creating a child node initialization script

For slave servers, there are only three things to do:

    1. Installing Docker

    2. Join the cluster

    3. Trust source

From then on, all the rest is left to Docker swarm to manage, and you don't have to ssh into this server anymore.

To simplify operations, you can write a shell script to run in batches. Create a file under the Slave-1 and SLAVE-2 servers with the following contents.

apt-get updateapt-get install -y apt-transport-https ca-certificates curl software-properties-commoncurl -fsSL | sudo apt-key add -add-apt-repository "deb [arch=amd64] bionic stable"apt-get updateapt-get install -y docker-ceecho ‘{ "insecure-registries":[""] }‘ >> /etc/docker/daemon.jsonsystemctl restart docker docker swarm join --token SWMTKN-1-0hqsajb64iynkg8ocp8uruktii5esuo4qiaxmqw2pddnkls9av-dfj7nf1x3vr5qcj4cqiusu4pv

Set this file to self-file and run:

chmod +x

As shown in.

After the script is ready to run, you can log out of Slave-1 and Slave-2 ssh. There is no need to come in again.

Go back to the master server and execute the following command to confirm that the cluster now has 3 nodes:

docker node ls

See that there are now 3 nodes in the cluster. As shown in.

Until then, the most complex and troublesome process is over. The rest is to experience the convenience of Docker swarm.

Create test program Build Test Redis

Because there is a need to simulate the running effect of a distributed crawler, you first use Docker to build a temporary Redis service:

Execute the following command on the master server:

docker run -d --name redis -p 7891:6379 redis --requirepass "KingnameISHandSome8877"

This redis uses the external 7891 port, the password is KingnameISHandSome8877 , the IP is the master server IP address.

Writing a test program

Write a simple Python program:

  1. import time

  2. import redis

  3. client = redis.Redis(host=‘‘, port=‘7891‘, password=‘KingnameISHandSome8877‘)

  4. while True:

  5.    data = client.lpop(‘example:swarm:spider‘)

  6.    if not data:

  7.        break

  8.    print(f‘我现在获取的数据为:{data.decode()}‘)

  9.    time.sleep(10)

This python reads a number from the Redis every 10 seconds and prints it out.

Writing Dockerfile

Write Dockerfile, create our own image based on the Python3.6 image:

from python:3.6label mantainer=‘[email protected]‘user rootENV PYTHONUNBUFFERED=0ENV PYTHONIOENCODING=utf-8run python3 -m pip install rediscopy spider.pycmd python3
Build image

After writing the completion Dockerfile, execute the following command to start building our own Image:

docker build -t localhost:8003/spider:0.01 .

It is important to note that because we want to upload this image to a private source for download from a node above the slave server, the name of the image needs to be in localhost:8003/自定义名字:版本号 such a format. The 自定义名字 and 版本号 can be modified according to the actual situation. In the example of this article, because I want to simulate a crawler program, so named it spider, because it is the 1th time to build, so the version number is 0.01.

The entire process is as shown.

Uploading images to a private source

Once the image is built, it needs to be uploaded to the private source. You need to execute the command at this time:

docker push localhost:8003/spider:0.01

As shown in.

Remember this build and upload command, each time you update the code, you need to use these two commands.

Create a service

Docker Swarm is running a single service, so you need to create a service using Docker service commands.

docker service create --name spider --network host

This command creates a spider service named. 1 containers are run by default. The operating conditions are as shown.

Of course, you can create a lot of containers to run, at this time only need to add a --replicas parameter. For example, a service is created with 50 containers running:

docker service create --name spider --replicas 50 --network host

But generally the first code may have a lot of bugs, so it is recommended to use 1 containers to run, observe the log, found that no problem later to expand.

Back to the default of 1 containers, this container may be on any of the three machines on the current one. Observe this default container run by executing the following command:

docker service ps spider

As shown in.

View node Log

Depending on the execution result, you can see the ID of the container in this run rusps0ofwids , then execute the following command to view the log dynamically:

docker service logs -f 容器ID

The log for this container is continuously tracked. As shown in.

Horizontal Scaling

Now that only 1 servers are running a container and I want to run the crawler with 3 servers, I need to execute a single command:

docker service scale spider=3

The run effect is as shown.

At this point, once again to look at the operation of the crawler, you can find three machines will each run a container. As shown in.

Now, we log on to the slave-1 machine and see if there is really a task running. As shown in.

You can see that a container is actually running on it. This is automatically assigned by the Docker swarm.

Now we use the following command to force the slave-1 above the Docker off, and then to see the effect.

systemctl stop docker

Go back to the master server and see how the crawler works again, as shown in.

As you can see, after the Docker swarm detects the Slave-1 drop, he automatically restarts a machine to start the task, ensuring that there are always 3 tasks running. In this case, the Docker swarm automatically launches 2 spider containers on the master machine.

If the machine performs better, you can even run several more containers on each of the 3 machines:

docker service scale spider=10

At this point, 10 containers are launched to run the crawlers. The 10 reptiles are isolated from one another.

What if you want all the crawlers to stop? Very simple, one command:

docker service scale spider=0

So all the reptiles will stop.

View logs for multiple containers at the same time

What if you want to see all the containers at once? You can use the following command to view the most recent 20 rows of logs for all containers:

docker service ps robot | grep Running | awk ‘{print $1}‘ | xargs -i docker service logs --tail 20 {}

In this way, the logs are displayed sequentially. As shown in.

Update crawler

If your code is modified. Then you need to update the crawler.

Modify the code first, rebuild, and resubmit the new image to the private source. As shown in.

Next, you need to update the mirrors in the service. There are two ways to update mirroring. One is to turn all crawlers off and then update.

docker service scale spider=0docker service update --image spiderdocker service scale spider=3

The second is the direct execution of the update command.

docker service update --image spider

The difference is that when the update command is executed directly, the running container is updated one at a time.

The run effect is as shown.

You can do more with Docker swarm.

This article uses an example of a simulated crawler, but it is clear that any program that can run in batches can be run with Docker Swarm, whether you use Redis or celery to communicate, whether you need to communicate or not, you can use Docker Swarm as long as you can run it in bulk.

Within the same swarm cluster, several different services can be run, each of which does not affect each other. Really do build a Docker swarm cluster, and then no longer the tube, all future operations you only need to run on this server where the manager node resides.

Build a distributed crawler cluster using Docker swarm

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.