1. Background
Cassandra uses a Distributed Hash table (DHT) to determine the nodes that store a data object. In the DHT, the nodes and Data Objects responsible for storage are allocated with a token. Token can only be set within a certain range. For example, if MD5 is used as the token, the value range is [0, 2 ^ 128-1]. Storage nodes and objects are arranged into a ring based on the token size, that is, the largest token is followed by the smallest token. For example, for MD5, the next token of token 2 ^ 128-1 is 0. Cassandra uses the following algorithms to distribute data:
First, each storage node is assigned a random token (involving the Data Partition policy), which represents its location on the DHT ring;
Then, the User specifies a key (Row-key) for the data object. Cassandra calculates a hash value based on this key as the token, and then determines the location of the object on the DHT Ring Based on the token;
Finally, the data object is stored by the node with the smallest token larger than the token of the object on the ring;
Back up the data object to another N-1 Node Based on the backup policy (involving the network topology Policy) that you specified at configuration. A total of N copies of the object exist in the network.
Therefore, each storage node should at least be responsible for storing the data objects on the ring located between it and its previous storage node, and these objects will be backed up to the same node. The area between any two points on the DHT ring is called a range, so each storage node needs to store the range between it and the previous storage node.
Because Cassandra backs up data in the unit of range, each node needs to regularly check nodes with the same range as it saves to see if there is any inconsistency. This involves the Data Consistency policy.
In addition, Cassandra features a write speed greater than the read speed, thanks to its storage policy.
This article summarizes the various policies used in Cassandra, including the data branch policies, data backup policies, network topology policies, data consistency policies, and storage policies.
2. partitioner Data Partition Policy
Store the key/value on different nodes according to the key. Partitioner assigns a token to each Cassandra node according to a certain policy. After each key/value is calculated, it is allocated to the corresponding node of the server.
The following distribution policies are provided:
Org. Apache. Cassandra. DHT. randompartitioner:
Store the key/value evenly to each node according to the MD5 value of the key. Because the key is out of order, all the policies cannot support query the range of keys.
Org. Apache. Cassandra. DHT. byteorderedpartitioner (BOP ):
Sort the key/value by key (byte) and store it on each node. This partitioner allows you to scan data in the order of keys. This method may cause load imbalance.
Org. Apache. Cassandra. DHT. orderpreservingpartitioner:
This policy is an out-of-date BOP that only supports UTF-8 encoded strings with the key.
Org. Apache. Cassandra. DHT. collatingorderpreservingpartitioner:
This policy supports key sorting in the en or US environment.
3. Backup Policy (copy placement policy)
In order to ensure reliability, it is generally necessary to write n copies of data, one of which is written on the corresponding node (determined by the Data sharding policy), and how to store the N-1, A backup policy is required.
Simplestrategy (formerly called rackunawarestrategy, corresponding to org. Apache. Cassandra. locator. rackunawarestrategy ):
Regardless of the data center, the token is taken from the first token position to N nodes as copies in ascending order.
Oldnetworktopologystrategy (formerly called rackawarestrategy, corresponding to org. Apache. Cassandra. locator. rackawarestrategy ):
Consider a data center. First, store N-1 copies of data on different rack of the data center where the primary token is located, and then store a copy of data on the node of another data center. This policy is especially suitable for application scenarios of multiple data centers, which can improve system reliability at the cost of performance (data latency.
Networktopologystrategy (formerly called datacentershardstrategy, corresponding to org. Apache. Cassandra. locator. datacentershardstrategy ):
This requires the copy policy attribute file, which defines the number of copies in each data center. The total number of copies in each data center should be equal to the number of keyspace copies.
4. network topology Policy
This policy is mainly used to calculate the relative distance between different hosts and then tell Cassandra about your network topology to route user requests more efficiently.
Org. Apache. Cassandra. locator. simplesnitch:
Use Cassandra ring as the logical distance between different hosts.
Org. Apache. Cassandra. locator. rackinferringsnitch:
The relative distance is determined by the rack and data center, which correspond to the IP address's 3rd and 2nd eight-bit groups respectively. That is, if the IP addresses of the two nodes are in the same group of the first three eight bits, they are considered to be in the same Rack (the distance between different nodes in the same rack is the same ); if the IP addresses of the two nodes have the same group of the first eight digits, they are considered to be in the same data center (the distance between the two nodes is the same ).
Org. Apache. Cassandra. locator. propertyfilesnitch:
Relative distance is determined by rack and data center, and they are set in the profile cassandra-topology.properties.
5. Scheduling Policy
Use a policy to schedule user requests to different nodes.
Org. Apache. Cassandra. scheduler. nosched: no scheduler is required.
Org. Apache. Cassandra. sched_id. roundrobinscheduler: user requests of different request_scheduler_id are put into different queues of nodes by polling.
6. Consistency Policy
6.1 consistency level
Cassandra adopts final consistency. Eventual consistency means that multiple copies of a data object in a distributed system may be inconsistent within a short period of time, but after a period of time, these copies will eventually be consistent.
One feature of Cassandra is that users can specify the consistency level (consistency level) for each read/insert/delete operation ). The casssandra API currently supports the following consistency levels:
Zero: it only makes sense for insert or delete operations. The node responsible for performing the operation sends the modification to all the backup nodes, but does not wait for any node to reply and confirm, so no consistency is guaranteed.
One: For insert or delete operations, the execution node ensures that the modification has been written to the commit log and memtable of a storage node. For read operations, the execution node returns results immediately after obtaining data on a storage node.
Quorum: assume that the number of backup nodes of the data object is N. Insert or delete operations must be written to at least n/2 + 1 storage node. For read operations, query data from n/2 + 1 storage node, and returns the latest timestamp data.
ALL: For the insert or delete operation, the execution node ensures that n (n is the replication factor) nodes are successfully inserted or deleted before a successful confirmation message is returned to the client, this operation fails. For a read operation, It queries n nodes and returns the latest timestamp data. Similarly, if no data is returned for a node, the read operation fails.
Note: Cassandra's default read/write mode W (Quorum)/R (Quorum). In fact, as long as W + r> N (n is the number of replicas) is ensured ), that is, the written and read nodes overlap and are strongly consistent. ifW+R<=NIs weak consistency. (w indicates the number of write nodes, and r indicates the number of read nodes ).
If you select the quorum level for both read and write operations, the latest changes can be made to each read operation. In addition, Versions later than Cassandra 0.6 support the any level for insert and delete operations, which means that data is written to a storage node. Different from the one level, any writes to the hinted handoff node as a success, and one requires that it be written to the final target node.
6.2 maintain final consistency
Cassandra maintains the final consistency of data through four technologies: Anti-entropy, read repair, and hinted handoff and distributed deletion.
(1) Inverse entropy
This is a synchronization mechanism between backups. Check the consistency of data objects between nodes on a regular basis. The Merkle tree method used here is inconsistent;
(2) read repair
When the client reads an object, it triggers a consistency check on the object;
Example:
When reading the data of key A, the System reads all data copies of key A. If any inconsistency is found, the consistency is repaired.
If the read consistency requirement is one, a copy of Data closest to the client is immediately returned. Then, read repair will be executed in the background. This means that the data read for the first time may not be the latest data. If the read consistency requirement is quorum, a copy will be returned to the client after more than half of the consistent copies are read, the consistency check and repair of the remaining nodes are performed in the background. If the read consistency requirement is high (all), only after the read repair is complete can a copy of consistent data be returned to the client. It can be seen that this mechanism helps reduce the final consistency time window.
(3) Prompt for transfer
For write operations, if one of the target nodes is not online, the object is first retransmitted to another node, and the target node such as the relay node is online and then the object is given to it;
Example:
Key A writes the node N1 according to the rules, and then copies it to N2. If n1 goes down and N2 writes meet the consistencylevel requirement, rowmutation corresponding to key a encapsulates a header with hint information (including information with the target N1 ), then a node N3 is randomly written, and the copy is unreadable. At the same time, a copy of data is normally copied to N2, which can be read. If the write N2 does not meet the write consistency requirements, the write will fail. After N1 is restored, the information with the hint header of N1 should be written back to N1.
(4) distributed Deletion
Single-host deletion is very simple. You only need to remove the data from the disk directly. For distributed systems, the difficulty of distributed deletion is: if a backup node A of an object is not online, and other backup nodes Delete the object, when a is online again, it does not know that the data has been deleted, therefore, this object on other backup nodes will be restored, which makes the deletion operation invalid. Cassandra's solution is: the local does not immediately delete a data object, but marks an hint for the object and periodically recycles the object marked with the hint. Before garbage collection, hint exists, which gives other nodes the opportunity to obtain the hint by several other consistency assurance mechanisms. Cassandra
This problem is cleverly solved by converting the delete operation into an insert operation.
7. Storage Policy
Cassandra's storage mechanism draws on the design of bigtable and adopts memtable and sstable methods. Like relational databases, Cassandra also needs to record logs before writing data, called commitlog (the commit log in the database is classified into Undo-log, redo-log, and undo-Redo-log, because Cassandra uses timestamps to identify new and old data and does not overwrite existing data, it does not need to use undo operations, so its commit log uses redo log ), the data is then written to the memtable corresponding to the column family, and the data in the memtable is sorted by key. Memtable is a memory structure. It is flushed to the disk in batches after certain conditions are met and stored as sstable. This mechanism is equivalent to the write-back
Cache), the advantage is that random I/O writes are converted into sequential I/O writes, reducing the pressure on the storage system for a large number of write operations. Once writing is completed, sstable cannot be changed and can only be read. The next memtable needs to be refreshed to a new sstable file. Therefore, for Cassandra, it can be considered that only sequential write is supported, and there is no random write operation.
Sstable cannot be modified. Generally, a CF may correspond to multiple sstables. In this way, when a user retrieves data, if each sstable is scanned again, this will greatly increase the workload. Cassandra uses bloomfilter to reduce unnecessary sstable scanning. Multiple hash functions map the key to a single-digit graph to quickly determine which sstable the key belongs.
To reduce the overhead of sstable, Cassandra periodically performs compaction. Simply put, compaction combines multiple sstables of the same cf into one sstable. In cassandra, compaction mainly completes the following tasks:
(1) garbage collection: Cassandra does not directly delete data, so disk space will consume more and more, compaction will delete the data marked as deleted;
(2) Merge sstable: compaction combines multiple sstable files into one (the merged files include index files, data files, and bloom filter files) to improve the efficiency of read operations;
(3) generate a merkletree: During the merge process, a merkletree about the data in this CF will be generated for comparison with other storage nodes and data restoration.
8. References
(1) "Cassandra consistency ":
Http://www.thoss.org.cn/mediawiki/index.php/Cassandra_consistency
(2) "Cassandra data storage structure and data read/write ":
Http://www.oschina.net/question/12_11855
(3) "Cassandra storage mechanism ":
Http://www.ningoo.net/html/2010/cassandra_storage.html
(4) CASSANDRA v. S. hbase:
Http://blog.sina.com.cn/s/blog_633f4ab20100r9nm.html
Link for this article:
Http://dongxicheng.org/nosql/cassandra-strategy/