FAQ of ETCD Cluster

Last Update:2018-08-21 Source: Internet

Author: User

Tags tmp file truncated etcd

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Preface

The previous article talked about some common cluster operations in ETCD, which mainly covered some common problems that might be encountered, after all, the God (Operational dimension) perspective always sees the problem and then recovers.

For a cluster, it is common to process crashes, physical machine downtime, data migration backups, capacity reduction, and so on. The rest of the operation is nothing more than some common problem-handling.

Backup recovery

Etcd from the strict sense, that is, a storage, but a distributed environment storage, and maintain strong consistency, that is, each time a leader to send a number of instructions, write data, must leader agree follower reply OK to write, And most of the nodes must respond normally.

Therefore, when the data backup, it is possible to back up any node casually.

1. Configure Scheduled tasks for backup

The scheduled task is configured to execute the script at 2 o'clock in the morning every day, keep only seven days of backup, and then back up the data to a fixed directory, the script backup mainly uses the Etcdctl to backup, as follows:

[Root@docker-ce python]# Cat backup.sh

#!/bin/bash

Date_time= ' Date +%y%m%d '

Etcdctl Backup--data-dir/etcd/--backup-dir/python/etcdbak/${date_time}

find/python/etcdbak/-ctime +7-exec rm-r {} \;

To set a timed task:

The error output and normal output are redirected to prevent mail from being sent, resulting in an increase in the number of inode.

2. Data Recovery

When you want to do data recovery, you can use the following steps:

Packages the backup data and sends it to the host to be restored.

Decompression Run ETCD:

In the start Etcd, in addition to specifying the data directory, and must use force-new-cluster parameters, otherwise there will be errors, cluster ID mismatch and other information.

3, about the data storage description

In the directory where the data is stored, at startup, the file directory results are as follows:

The file directory in the backup looks like this:

As can be seen from above, the DB files and TMP files are discarded, the TMP file is discarded primarily uncommitted data records, and the discarded DB information is the cluster and some information.

4, a single node to expand into a cluster

After the backup, then do the decompression error, and then start the ETCD process, start, note the use of the relevant parameters, as follows:

[Root@docker-ce etcd]# etcd--name docker-ce--data-dir/etcd1--initial-advertise-peer-urls http://192.168.1.222:2380 --listen-peer-urls http://192.168.1.222:2380--listen-client-urls http://192.168.1.222:2379,http://127.0.0.1:2379 --advertise-client-urls http://192.168.1.222:2379--initial-cluster-token etcd-cluster--initial-cluster centos= http://192.168.1.22:2380,docker-ce=http://192.168.1.222:2380--force-new-cluster

To add a new member information:

To start the ETCD process on a new machine:

[Root@docker1/]# etcd--name docker1--data-dir/etcd--initial-advertise-peer-urls http://192.168.1.32:2380-- Listen-peer-urls http://192.168.1.32:2380--listen-client-urls http://192.168.1.32:2379,http://127.0.0.1:2379-- Advertise-client-urls http://192.168.1.32:2379--initial-cluster-token etcd-cluster--initial-cluster docker-ce= http://192.168.1.222:2380,docker1=http://192.168.1.32:2380--initial-cluster-state Existing

Note the update of the parameters, otherwise the cluster ID information mismatch, peer information mismatch and other errors.

Problems that may arise

1, the clock is different step

When the clock is different, the error appears as follows:

2018-02-09 05:45:37.636506 W | rafthttp:the Clock difference against peer 5d951def1d1ebd99 is too high [8h0m2.595609129s > 1s]

2018-02-09 05:45:37.717527 W | rafthttp:the Clock difference against peer f83aa3ff91a96c2f is too high [8h0m2.52274509s > 1s]

The time is synchronized.

2, the cluster ID does not match

The main reason is that the data directory is not deleted, then cause the cluster ID mismatch, delete the data directory, and then rejoin.

3, delete the time Data directory error

2018-02-07 22:05:58.539721 I | raft:e0f5fe608dbc732d became follower at term 11

2018-02-07 22:05:58.539833 C | Raft:tocommit is out of range [Lastindex (0)]. Was the raft log corrupted, truncated, or lost?

Panic:tocommit is out of range [Lastindex (0)]. Was the raft log corrupted, truncated, or lost?

Goroutine [Running]:

Github.com/coreos/pkg/capnslog. (*packagelogger). PANICF (0xc4201730e0, 0x559ecf0e5ebc, 0x5d, 0xc420121400, 0x2, 0x2)

/builddir/build/build/etcd-1e1dbb23924672c6cd72c62ee0db2b45f778da71/godeps/_workspace/src/github.com/coreos/ Pkg/capnslog/pkg_logger.go:75 +0x15e

Github.com/coreos/etcd/raft. (*raftlog). Committo (0xc42021a380, 0x19)

/builddir/build/build/etcd-1e1dbb23924672c6cd72c62ee0db2b45f778da71/src/github.com/coreos/etcd/raft/log.go:191 +0x15e

Github.com/coreos/etcd/raft. (*raft). Handleheartbeat (0xc42022c1e0, 0x8, 0xe0f5fe608dbc732d, 0x5d951def1d1ebd99, 0xb, 0x0, 0x0, 0x0, 0x0, 0x0, ...)

/builddir/build/build/etcd-1e1dbb23924672c6cd72c62ee0db2b45f778da71/src/github.com/coreos/etcd/raft/raft.go:1100 +0x56

Github.com/coreos/etcd/raft.stepfollower (0xc42022c1e0, 0x8, 0xe0f5fe608dbc732d, 0x5d951def1d1ebd99, 0xb, 0x0, 0x0, 0x0, 0x0, 0x0, ...)

/builddir/build/build/etcd-1e1dbb23924672c6cd72c62ee0db2b45f778da71/src/github.com/coreos/etcd/raft/raft.go:1046 +0X2B5

Github.com/coreos/etcd/raft. (*raft). Step (0xc42022c1e0, 0x8, 0xe0f5fe608dbc732d, 0x5d951def1d1ebd99, 0xb, 0x0, 0x0, 0x0, 0x0, 0x0, ...)

/builddir/build/build/etcd-1e1dbb23924672c6cd72c62ee0db2b45f778da71/src/github.com/coreos/etcd/raft/raft.go:778 +0x10f9

Github.com/coreos/etcd/raft. (*node). Run (0xc420354000, 0XC42022C1E0)

/builddir/build/build/etcd-1e1dbb23924672c6cd72c62ee0db2b45f778da71/src/github.com/coreos/etcd/raft/node.go:323 +0x67d

Created by Github.com/coreos/etcd/raft.restartnode

/builddir/build/build/etcd-1e1dbb23924672c6cd72c62ee0db2b45f778da71/src/github.com/coreos/etcd/raft/node.go:223 +0x340

This is mainly the need to add nodes as a new node into the cluster, the direct start words because the cluster's file and log files can not be found, thus the error.

To be continued .....

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More