1. Inverse entropy
The Cassandra database draws on Amazon's Dynamo in a distributed architecture, and references Google's bigtable on the data storage model. So in the data consistency and dynamo and BigTable have a deep connection, inverse entropy mechanism is a manifestation of this connection. inverse entropy, like the gossip protocol, is an algorithm based on infectious disease theory, which is mainly used to ensure that data on different nodes can be updated to the latest version. To understand the inverse entropy must first understand merkle tree, in Cassandra each data item can be expressed as (key, value) pair, key evenly distributed in a 2^n key space (such as key can take value of SHA 1 hash value). Two nodes generate a Merkle Tree for the data set, respectively, for data synchronization. Merkey tree is a binary number. The lowest level of the Merkel Tree can be the XOR value (XOR) of 16 keys. Each parent node is an XOR value of two child nodes. Thus, at the time of comparison, two nodes first pass the topmost tree node, and if they are equal, then no further comparisons are necessary. Otherwise, compare the left and right sub-trees respectively. cassandra is based on the above mentioned comparison mechanism to determine whether the data between two nodes is consistent, if the inconsistent node will be through the data record in the time stamp to be more line. The Merkle tree in cassandra is somewhat different from the Merkle tree in Amazon's Dynamo, In Cassandra we require each column family to have its own merkle tree, and during the main compaction operation, the Merkle tree is created as a snapshot, and its lifetime is limited to the time it needs to be sent to the neighbor node, thereby reducing the I/O operation of the disk. In each update, the inverse entropy algorithm is introduced, which verifies the database and compares the checksum with the other nodes. If the checksum is different, the data is exchanged, which requires a time window to ensure that the other nodes have the opportunity to get the most recent update, so that the system does not have the necessary inverse entropy operation. inverse entropy solves the problem of data consistency in Cassandra database to a large extent, but there are some problems in this strategy. In the dataThe Merkle Tree can reduce network transport overhead when the volume difference is small. However, the two participating nodes all need to traverse all data items to calculate the Merkle Tree, and the computational overhead (or IO overhead, if required to read data from disk) is large and may affect the server's external service, which is the main reason why some large companies have abandoned Cassandra.   2, read fix There are two types of read requests, and one coordinator (read proxy) can send these two read requests to a single copy: direct read requests and read repair requests. The number of copies to be read by the read request will be set by the user on the invocation of the read request, for example: When set to one, a copy will be read only and set to quorum, then a copy will be returned to the client after reading more than half of the consistent copies. The read-fix mechanism detects and repairs all replicas as they are sent back to the user, ensuring that all replicas remain consistent. users have specified a level of consistency when they request data to Cassandra. The coordinator of the read request reads the nodes in the Cassandra database according to the consistency bounds of the user, compares the read results, checks whether they are consistent, and if they are consistent, returns the corresponding values without surprise, if not consistent, The latest data is extracted from the data based on the timestamp and returned to the user. The result has been returned to the user, however, in order to ensure the consistency of the data in the database, Cassandra will be in the background to do all the relevant data copy consistency detection, and those that do not meet the consistency of the data consistency synchronization, this is the repair process of the read repair mechanism. For example, in a cluster, the replica factor is 3 (three copies of the same data), and the consistency level at read is specified as 2, which means that a single data will read two of the 3 backups. In this case, as shown in 1, Cassandra will read two copies for us, determine the latest copy data in two copies, and then return it to the user, and the read-fix policy will fix the third non-read copy to determine that the data for the three replicas is consistent. Figure 1 Cassandra Read repair mechanism (dashed for read repair process) 3, prompt handover When a write request arrives at Cassandra, If the Cassandra node responsible for this part is at this timeReason is not able to meet the requirements of the user-specified replica factor, this time the write will become a troublesome thing, the write will be due to the loss of the node failure. To solve this problem, Cassandra, like some other distributed scenarios, proposes a referral mechanism. This mechanism refers to when writing because the corresponding node is not able to meet the replica factor, the data will be written to the other nodes, and then return to the user write success, when the relevant node and restore service, Cassandra will write to the other nodes of the part of the data is new to the node. prompt surrender allows Cassandra to be always available for write operations, reducing inconsistencies after the Write node recovery service, when the user's consistency level is set to any, This means that even if a hint is recorded, the write operation can be considered a success. For example: Key a follows the rules of the primary write node to N1, and then copies it to N2. If the N1 is down, and if the write N2 satisfies the conformance level, the row mutation for key A will encapsulate a header with hint information (containing the information targeted for N1), and then randomly write to a node N3, which is not readable. While a copy of the data is normally copied to N2, this copy can provide read. If write N2 does not meet the write consistency requirement, the write fails. When N1 resumes, the message with hint headers that should have been written to N1 will be re-written back to N1. prompt handover mechanism is used to maintain consistency of data writing in many distributed scenarios, and is considered a thoughtful design that keeps the database persistent. And this mechanism also appears in a number of distributed computing modes, such as the Java Messaging Service (JMS). In a persistent "guaranteed delivery" JMS queue, if the message fails to send the recipient, JMS waits for a given time and then re-passes until it is successfully received. However, in the actual system, whether for the reliable transmission of JMS or the handover of Cassandra, there is a problem: if the node is offline for a period of time, there are already a lot of information on the other nodes, then after the node is back online, Requests will be sent centrally to this node, which is unbearable for a very vulnerable node that is just recovering from service. 4, distributed delete a lot of simple operations in a single machine, once placed in a centralized distributed environment is not so simple, like delete, single-machine delete is very simple, Just remove the data directly from the disk, and for distributed, it's much different. Distributed-deletedThe difficulty is that if a backup node A of an object is not currently in line, and the other backup node deletes the object, then when a is online again, it does not know that the data has been deleted, so it attempts to recover the object on the other backup nodes, which invalidates the deletion. distributed deletion mechanism is to solve the problem of the above mentioned distributed deletion. Deleting a column actually simply inserts a tombstone about this column (tombstone) and does not delete the original column directly. The tombstone is used as a modification to the column family, recorded in Memtable and Sstable. The tombstone's content is the time at which the delete request is executed, which is the local time (the local delete time) of the storage node that accepts the request from the client when the request is executed, known as the locally deleted times. Note that the local delete time and timestamp should be distinguished, each column family modification record has a timestamp, which can be understood as the modified time of this column, is given by the client, while the local delete time is only available when the distributed deletion mechanism is used. because the deleted column is not immediately removed from disk, the system consumes more disk space, which requires a garbage collection mechanism that periodically deletes the column labeled Tombstone , and in the Cassandra the garbage collection is done during the compaction process.
Cassandra policies for maintaining data consistency