"Distributed" Zookeeper data and storage

Source: Internet
Author: User
Tags server memory

First, preface

Before analyzing the processing of zookeeper request, this blog post then analyzes how to store the underlying data in zookeeper, the data storage is divided into memory data stored in the disk data storage.

Ii. Data and storage

  2.1 Memory Data

Zookeeper data model is a tree structure, in the memory database, the contents of the whole tree, including all the node path, node data, ACL information, zookeeper will periodically store this data on disk.

  1. Datatree

Datatree is the core of the memory data store and is a tree structure that represents a complete piece of data in memory . Datatree does not contain any business logic related to network, client connection, and request processing, and is a separate component.

  2. DataNode

  Datanode is the smallest unit of data storage , in addition to preserving the data content of the end point, ACL list, node state, but also records the parent node's Reference and child node list two attributes, it also provides the child node list to operate the interface.

  3. Zkdatabase

Zookeeper's in-memory database that manages all zookeeper sessions, Datatree storage, and transaction logs. Zkdatabase will periodically dump the snapshot data to disk, and at zookeeper startup, the transaction log and snapshot files of the disk will be restored to a full memory database.

  2.2 Transaction log

  1. File storage

When configuring the zookeeper cluster, you need to configure the DataDir directory, which is used to store transaction log files. You can also assign a separate file store directory for the transaction log: Datalogdir. If the configuration Datalogdir is/home/admin/zkdata/zk_log, then zookeeper will create a subdirectory named Version-2 in the directory during the run. This directory determines the version number of the transaction log format used by the current zookeeper, and the next time a zookeeper version changes the transaction log format, this directory will also change, that is, a series of file size consistent (64MB) files will be generated under the Version-2 subdirectory.

  2. Log format

After you have configured the log file directory and started zookeeper, do the following

(1) Create a/test_log node with an initial value of v1.

(2) The data for updating the/test_log node is v2.

(3) Create a/TEST_LOG/C node with an initial value of v1.

(4) Delete the/TEST_LOG/C node.

After a four-step operation, a log file is generated under the/log/version-2/directory, and the author is log.cec.

Copy the Zookeeper-3.4.6.jar and Slf4j-api-1.6.1.jar under Zookeeper to the/log/version-2 directory and open the LOG.CEC file using the following command.

Java-classpath./zookeeper-3.4.6.jar:./slf4j-api-1.6.1.jar Org.apache.zookeeper.server.LogFormatter LOG.CEC

  

ZooKeeper transactional Log File with dbid 0 txnlog format version 2. is the file header information, mainly the dbid of the transaction log and the version number of the log format.

... session 0x159 ... 0XCEC createsession 30000. Represents a client session creation operation.

... session 0x159 ... 0xced create '/test_log,... 。 Represents the creation of a/test_log node with data content of #7631 (v1).

... session 0x159 ... 0xcee setData '/test_log,...。 Indicates that the/test_log node data is set, and the content is #7632 (v2).

... session 0x159 ... 0XCEF Create '/test_log/c,...。 Represents the creation of a node/test_log/c.

... session 0x159 ... 0xcf0 Delete '/test_log/c. Represents the Delete node/test_log/c.

  3. Log Write

Filetxnlog is responsible for maintaining the external interface of the transaction log, including writing and reading the transaction log. The transaction log writing process for Zookeeper can be broadly divided into the following 6 steps.

(1) determine if a transaction log is writable . When the zookeeper server starts to finish writing the first transaction log, or when the last transaction log is full, it will be in a state that is disconnected from the transaction log file, that is, the zookeeper server is not associated with any log file. Therefore, before the transaction log is written, zookeeper first determines whether the Filetxnlog component has been associated with the previous writable transaction log file. If not, a transaction log file is created using the ZXID associated with the transaction operation as a suffix, the file header information for the transaction log is built, and the transaction log file is immediately written, and the file stream of the file is placed in the Streamtoflush collection. This collection is used to record the current file stream that needs to force the data to drop.

(2) determine if the transaction log file needs to be expanded (pre-allocated). Zookeeper uses a disk space pre-allocation policy. When there is less than 4096 bytes remaining in the current transaction log file, file space expansion is initiated, that is, on the existing file size, increase the file by 65536KB (64MB), and then use "0" to populate the enlarged file space.

(3) serialization of transactions . The serialization of the transaction header and the transaction body, in which the transaction body can be divided into session creation transaction, node creation transaction, node Delete transaction, node data update transaction, etc.

(4) generate checksum. To ensure the integrity of the log file and the accuracy of the data, zookeeper calculates the build checksum before the transaction log is written to the file.

(5) writes the transaction log file stream . Writes the serialized transaction header, transaction body, and checksum to the file stream, at this time, and is written to disk.

(6) The transaction log is brushed into the disk . Because of the caching reasons in step 5, the disk file cannot be written to in real time, so you need to force the cached data into the disk.

  4. Log truncation

During the zookeeper run, the transaction ID of the non-leader record may appear larger than the leader, which is an illegal running state. At this point, you need to ensure that all machines must be synchronized with the leader data, that is, leader will send the trunc command to the machine, log truncation is required, and after learner receives the command, all transaction log files that contain or are larger than the transaction ID are deleted.

  2.3 snapshot-Data Snapshot

A data snapshot is a very central operating mechanism in the zookeeper data store that records the full amount of memory data at a time on the zookeeper server and writes it to the specified disk file.

  1. File storage

Similar to transactional files, zookeeper snapshot files can also specify a specific disk directory, which is configured through the DataDir property. If you specify DataDir as/home/admin/zkdata/zk_data, a directory of version-2 is created under this directory during the run, which determines the version number of the snapshot data format used by the current zookeeper. A series of files are generated when the zookeeper is run.

  2. Data snapshots

FILESNAP is responsible for maintaining the external interface of the snapshot data, including the write and read of the snapshot data, and writing the memory database to the snapshot data file is actually a serialization process. For each transaction operation of the client, zookeeper will log them to the transaction log and also apply the data changes to the in-memory database, zookeeper the full amount of data from the memory database to a local file after several transaction log records, which is a snapshot of the data. The steps are as follows

(1) determine if a data snapshot is required . After each transaction log, zookeeper detects whether a snapshot of the data is currently required, taking into account the impact of the data snapshot on the zookeeper machine and minimizing the need to take a snapshot of the data at the same time for all machines in the zookeeper cluster. Take the data snapshot operation with more than half a random policy.

(2) switch transaction log files . Indicates that the current transaction log is full and that a new transaction log needs to be recreated.

(3) Create a data snapshot asynchronous thread . Create separate asynchronous threads to take a snapshot of the data to avoid affecting the zookeeper master process.

(4) get the full amount of data and session information . Gets the Datatree and session information from the Zkdatabase.

(5) generate snapshot data file name . Zookeeper generates a data snapshot file name based on the maximum zxid that is currently committed.

(6) serialization of data . The file header information is serialized first, and then the session information and Datatree are serialized separately, and a checksum is generated and written to the snapshot data file.

 2.4 Initialization

During the zookeeper server startup, data initialization is performed first to load the data files stored on disk into the zookeeper server memory.

  1. Initialization process

Zookeeper's book initialization process as shown

The initialization of the data is the process of loading data from the disk, including the loading of snapshot data from the snapshot file and the data correction based on the physical log two processes.

(1) initialization of Filetxnsnaplog. Filetxnsnaplog is the zookeeper transaction log and snapshot data access layer, which is used to connect the upper business and the underlying data storage, and the underlying data contains both the transaction log and the snapshot data. The filetxnsnaplog corresponds to Filetxnlog and Filesnap.

(2) initialization of Zkdatabase. The Datatree is built first, and the Filetxnsnaplog is delivered zkdatabase so that the in-memory database can access the transaction logs and snapshot data. When Zkdatabase is initialized, Datatree will also perform the corresponding initialization work, such as creating some default nodes, such as/,/zookeeper,/zookeeper/quota three node.

(3) create Playbacklistener. It is mainly used to receive callbacks in the process of transaction application, there will be a transaction correction process during the late Zookeeper data recovery, this process will callback Playbacklistener to make corresponding data correction.

(4) processing snapshot files . The data can now be recovered from the disk, first loaded from the snapshot file.

(5) get the latest 100 snapshot files . The latest snapshot file contains the most up-to-date full-volume data.

(6) parse the snapshot file . Parsing the snapshot files one by one requires deserialization, generating datatree and sessionswithtimeouts, as well as verifying the correctness of checksum and snapshot files. For 100 quick-find files, if the correctness check passes, only the most recent snapshot file is usually parsed. Only the most recent snapshot files are not available until the 100 snapshot files are resolved, one after the other. If a full datatree and sessionwithtimeouts cannot be successfully recovered after parsing 100 snapshot files, the server fails to start.

(7) get the latest Zxid. The latest Zxid:zxid_for_snap can be resolved based on the file name of the snapshot. The ZXID represents the moment the zookeeper begins to take a snapshot of the data.

(8) Process the transaction log . At this point the server memory already has an approximate full amount of data, and now begins to update the delta data through the transaction log.

(9) get all the transactions that were committed after Zxid_for_snap . At this point, you can already get the latest Zxid for snapshot data. You only need to get all the ZXID from the transaction log for the ZXID large transaction operations that were obtained from step 7.

(ten) transaction Applications . After you get transactions that are larger than zxid_for_snap, apply them one-by-one to the Datatree and sessionswithtimeouts that were previously recovered from the snapshot data file. Whenever a transaction is applied to the in-memory database, zookeeper also callbacks Playbacklistener, converts the transaction operation record to proposal, and saves it to Zkdatabase Committedlog. So that follower can be synchronized quickly.

get the latest Zxid. After all the transactions are fully applied to the in-memory database, the initialization of the data is basically done, and the ZXID is obtained again to identify the maximum transaction ID that was committed when the server was last run.

Check the Epoch. The epoch identifies the current leader cycle, and when the cluster machines communicate with each other, the epoch is taken to ensure that they are in the same leader cycle. After the data load is completed, zookeeper resolves the leader period of the transaction in Zxid from step 11: Epochofzxid. The last recorded epoch value is also read from the disk's Currentepoch and Acceptedepoch files for verification.

  2.5 Data Synchronization

After the entire cluster completes the leader election, learner will register with the leader, when learner to the leader to complete the registration, will enter the data synchronization link, The synchronization process is leader to synchronize the transaction requests that are not submitted on the learner server to the learner server, the general process is as follows

(1) get the learner status . In the final phase of registering learner, the learner server sends a ACKEPOCH packet to the leader server, and leader resolves the learner and Currentepoch from the packet.

(2) data synchronization initialization . The proposed cache queue proposals for the transaction request is first extracted from the Zookeeper memory database, and the PEERLASTZXID (the learner last processed ZXID) is completed, Mincommittedlog ( Leader proposed cache queue Commitedlog the smallest zxid), Maxcommittedlog (leader proposed cache queue Commitedlog maximum Zxid) three Zxid value initialization.

For cluster data synchronization, it is usually divided into four categories, Direct differential synchronization (diff synchronization), first rollback and differential synchronization (Trunc+diff synchronization), rollback-only synchronization (TRUNC synchronization), full-volume synchronization (SNAP synchronization), during the initialization phase, Leader will prioritize synchronizing the data in a full-volume synchronization mode. At the same time, the final data synchronization method is determined based on the data differences between leader and learner.

  • Direct differential synchronization (diff Sync, Peerlastzxid between Mincommittedlog and Maxcommittedlog). Leader first to this learner to send a diff instruction, to notify learner into the differential data synchronization phase, leader will be some proposal sync to themselves, for each proposal, Leader is done by sending proposal content packets and commit instruction packets,

  • First roll back and then differential synchronization (Trunc+diff synchronization, leader has logged transactions to the local transaction log, but did not successfully initiate the proposal process). When leader discovers that a learner contains a transaction record that it does not have, it needs the learner to rollback the transaction, rollback to the leader server, and the Peerlastzxid closest to Zxid.

  • Rollback-only synchronization (trunc sync, Peerlastzxid greater than maxcommittedlog). Leader requires that learner rollback to a transaction operation that has a ZXID value of Maxcommittedlog.

  • Full-volume synchronization (Snap sync, peerlastzxid less than mincommittedlog or Peerlastzxid not equal to LASTPROCESSEDZXID). Leader cannot synchronize directly with the proposed cache queue and learner, so only full-volume synchronization is possible. Leader synchronizes the full amount of memory data from this machine to learner. Leader first sends a snap instruction to learner, notifying learner that a full-volume synchronization is imminent, and then leader fetches the full amount of data nodes and session time-out loggers from the in-memory database and serializes them to learner. After the learner receives the full amount of data, it is deserialized and loaded into the in-memory database.

Iii. Summary

This post mainly explains the zookeeper data and storage, including memory data, snapshot data, and how to synchronize the data and other details, so far, the theoretical study of zookeeper has been completed, then the source analysis, but also thank you for the view of the Garden friends ~

"Distributed" Zookeeper data and storage

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.