The layout and format of the Zookeeper learning file system

Source: Internet
Author: User

This article is about snapshot files, how transaction log files are stored in the file system.

Writing transaction logs is a critical step in transaction processing, so it is highly recommended to store on a separate disk. Snapshots do not need to be stored on separate disks, because they are generated by a background thread in a lazy (lazily) manner.

The path to the storage snapshot is specified by the DataDir parameter, and the path to the transaction log is specified by the Datalogdir parameter. First, take a look at the directory of the transaction log. If you look at the contents of the directory, you will see a folder named Version-2. For logs and snapshots there is now only one format, if there are other versions of the format, such as in the format of the version of the data isolation, for later versions of the data migration will be more convenient.

Transaction log

After performing some tests, take a look at the directory situation, with only two transaction log files:


-rw-r--r--1 Breed 67108880 June 5 22:12 log.100000001
-rw-r--r--1 breed 67108880 Jul 21:37 log.200000001

To observe, first of all, they are somewhat large, each file is about 6 m, and there are actually very few tests performed. Second, the suffix of the file name is a large number.

ZK pre-allocates a fairly large chunk (chunk) to the file to avoid the administrative overhead of metadata that is generated each time the file is written. If you dump these files with 16, you will see that there are some null characters (the \ s characters), except for some data binary data at the beginning of the file. After the server has been running for some time, these null characters are replaced by the actual log data.

The filename suffix of the number is ZXID, can be easily recovery and can quickly find, this zxid is the first zxid of the log file, and is 16 binary. The reason for using 16 is because it is easy to see the epoch and the counter. So the first file belongs to the Epoch 1 and the second one belongs to the Epoch 2.

It is certainly better to see the data inside the file, which is necessary when you need to locate the problem. Developers need to spend a lot of time investigating why ZK lost Znode data, only by looking at the transaction log to see if it was deleted by the client.

We can view the second log file with the following command:

[Java]View PlainCopy
    1. JAVA-CP $ZK _libs org.apache.zookeeper.server.LogFormatter version-2 log. 200000001

The output is as follows:


7/15/13 ... Session 0x13 ... XX cxid 0x0 Zxid 0x200000001 createsession 30000
7/15/13 ... Session 0x13 ... XX cxid 0x2 zxid 0x200000002 Create
'/test, #22746573746 ...
7/15/13 ... Session 0x13 ... XX cxid 0x3 zxid 0x200000003 Create
'/test/c1, #6368696c ...
7/15/13 ... Session 0x13 ... XX cxid 0x4 zxid 0x200000004 Create
'/test/c2, #6368696c ...
7/15/13 ... Session 0x13 ... XX cxid 0x5 zxid 0x200000005 Create
'/test/c3, #6368696c ...
7/15/13 ... Session 0x13 ... XX cxid 0x0 Zxid 0x200000006 closesession null

Every transaction is printed in a human-readable way. Because there is only a change operation in the transaction, you will not see the read operation.

Snapshot

A snapshot's naming pattern is similar to the transaction log pattern. This is an example of a transaction log:


-rw-r--r--1 br33d 296 June 5 07:49 snapshot.0
-rw-r--r--1 br33d 415 Jul 21:33 snapshot.100000009

The snapshot file is not pre-allocated, so size correctly reflects the actual amount of data. The suffix used reflects the current ZXID when the snapshot starts. The previous article said that the snapshot file is actually fuzzy. The snapshot data is correct only after the corresponding transaction log is replayed. In order to recover data, a transaction log after a snapshot file suffix must be replayed.

The snapshot files are stored in binary form, and there is another tool to parse the snapshot files:

[Java]View PlainCopy
    1. JAVA-CP zk_libs org.apache.zookeeper.server.SnapshotFormatter version-2 snapshot. 100000009


The output is as follows:
----
/
Czxid = 0x00000000000000
CTime = Wed Dec 16:00:00 PST 1969
Mzxid = 0x00000000000000
Mtime = Wed Dec 16:00:00 PST 1969
Pzxid = 0x00000100000002
Cversion = 1
dataversion = 0
aclversion = 0
Ephemeralowner = 0x00000000000000
datalength = 0
----
/sasd
Czxid = 0x00000100000002
CTime = Wed June 07:50:56 PDT 2013
Mzxid = 0x00000100000002
Mtime = Wed June 07:50:56 PDT 2013
Pzxid = 0x00000100000002
cversion = 0
dataversion = 0
aclversion = 0
Ephemeralowner = 0x00000000000000
Datalength = 3
----
Only the metadata for each znode is dump. This allows the administrator to find out that the Znode data has been changed, and those znode consume a lot of memory. Unfortunately, Znode data and ACLs are not printed. Remember to use the merged data of the snapshot and its corresponding log file when locating the problem.

Epoch File

In addition, there are 2 small files to make the ZK state persistent. There are 2 epoch files, namely Acceptedepoch and Currentepoch. These two files reflect the epoch number that the specified server process has seen and participated in. Although these files do not contain any application-level data, they are important for data consistency, so do not omit these 2 files while you are backing up your data files.

Using ZK's data

Both the standalone mode and the cluster mode store the data in the same way. We just mentioned that if you get the correct data by merging snapshots and logs. You can copy log files and snapshot files to another machine, such as your laptop, put them into a clean data directory in a standalone mode, and start the server, and the data will be reproduced on this server. This allows you to see data on services that are approximate to the production environment. This also means that you can easily copy files for easy backup. If you choose this approach you need to pay attention to some things. First, the ZK is distributed, so the data is redundant. When you do a backup, you only need to back up one of the server's data.

Be sure to remember that when a ZK server ack a transaction, it promises to remember the state at that time. So if you use an old backup data to recover a server, you make this server violate its promise. This is not a big problem if you have just suffered a global loss of data, but if you put the old data in one server in a working cluster, this can cause other servers to lose state.

If you want to do data recovery for all servers or most servers, the best thing to do is to get the latest state (the latest (up-to-date) data obtained from the surviving machine) and copy it to the corresponding data directory before starting each server.

The layout and format of the Zookeeper learning file system

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.