Various strategies in the Cassandra

Source: Internet
Author: User
Tags cassandra commit flush garbage collection hash md5 relative


Various strategies in the Cassandra

http://dongxicheng.org/nosql/cassandra-strategy/


1. Background information

Cassandra uses a distributed hash table (DHT) to determine the node that stores a data object. In DHT, the node that is responsible for the storage and the data object are assigned a token. Tokens can only be used within a certain range, for example, if you use MD5 as token, then the value range is [0, 2^128-1]. The storage nodes and objects are arranged in a ring according to the token size, that is, the largest token is followed by the smallest token, for example, for MD5, token 2^128-1 's next token is 0. Cassandra uses the following algorithms to distribute data:

First, each storage node is assigned a random token (involving a data partitioning strategy) that represents its position on the DHT ring;

The user then assigns a key (i.e., Row-key) to the data object, Cassandra calculates a hash value as token based on the key, and then determines the position of the object on the DHT ring based on token;

Finally, the data object is stored by the node that has the smallest token that is larger than the token of the object on the ring;

The data object is backed up to a different N-1 node based on the backup policy specified by the user at the time of configuration, which involves a network topology policy. There is a total of N copies of the object in the network.

Therefore, at the very least, each storage node is responsible for storing those data objects on the ring that are located between it and its previous storage node, and these objects are backed up to the same node. We refer to the area between any two points on the DHT ring as a range, so each storage node needs to store a range between it and the previous storage node.

Because Cassandra is backed up in range, each node needs to periodically check the node that holds the same range as it does to see if there is an inconsistency, which involves a data consistency policy.

In addition, one of the features of Cassandra is that the write speed is greater than the read speed, thanks to its storage strategy.

This paper summarizes the various strategies used in Cassandra, including data branch policy, data backup strategy, network topology strategy, data consistency policy and storage strategy.

2. Partitioner Data Partitioning policy

Store the key/value on a different node according to the key. Partitioner will assign a token to each Cassandra node according to a strategy, and each key/value will be assigned to the node corresponding to it after some sort of calculation.

The following distribution strategies are available:

Org.apache.cassandra.dht.RandomPartitioner:

The key/value is stored evenly on each node according to the MD5 value of (key). Because key is unordered, all this policy cannot support a range query for key.

Org.apache.cassandra.dht.ByteOrderedPartitioner (BOP):

The key/value is sorted by key (byte) and stored on each node. The partitioner allows the user to scan the data in the order of key. This method can cause load imbalance.

Org.apache.cassandra.dht.OrderPreservingPartitioner:

This policy is an obsolete bop that supports only key UTF8 encoded strings.

Org.apache.cassandra.dht.CollatingOrderPreservingPartitioner:

This strategy supports key ordering in en or US environments.

3. Backup policy (copy placement policy)

In order to ensure reliability, it is generally necessary to write n copies of the data, one of which is written on its corresponding node (determined by the data fragmentation strategy), and the other N-1 copies of how to store, need to have a corresponding backup strategy.

Simplestrategy (formerly known as Rackunawarestrategy, corresponds to Org.apache.cassandra.locator.RackUnawareStrategy):

Regardless of the data center, the token is taken from the first token location to a copy of N nodes in order from small to large.

Oldnetworktopologystrategy (formerly known as Rackawarestrategy, corresponds to Org.apache.cassandra.locator.RackAwareStrategy):

Consider the data center, first, store N-1 data on different rack in the data center where the primary token resides, and then store a copy of the data in another node in a different data center. This strategy is particularly suitable for multiple data center scenarios, which can improve system reliability at the expense of partial performance (data latency).

Networktopologystrategy (formerly known as Datacentershardstrategy, corresponds to Org.apache.cassandra.locator.DataCenterShardStrategy):

This requires a copy of the policy properties file, in which the number of copies in each datacenter is defined. The sum of copies in each data center should be equal to the number of copies of Keyspace.

4. Network topology policy

This strategy is primarily used to calculate the relative distances of different hosts, which in turn tells Cassandra your network topology so that user requests can be routed more efficiently.

Org.apache.cassandra.locator.SimpleSnitch:

The distance between different host logic (Cassandra Ring) as a relative distance between them.

Org.apache.cassandra.locator.RackInferringSnitch:

The relative distance is determined by rack and data center, respectively, corresponding to the 3rd and 2nd eight-bit groups of the IP. That is, if the first 3 eight-bit groups of the two-node IP are the same, they are considered to be in the same rack (different nodes in the same rack), and if the first two eight-bit groups of the two-node IP are the same, they are considered to be in the same datacenter (different nodes in the same data center, the same distance).

Org.apache.cassandra.locator.PropertyFileSnitch:

The relative distances are determined by rack and data center, and they are set in the configuration file cassandra-topology.properties.

5. Scheduling Policy

A policy is dispatched to a different node for a user request.

Org.apache.cassandra.scheduler.NoScheduler: No Scheduler

Org.apache.cassandra.scheduler.RoundRobinScheduler: User requests for different request_scheduler_id are placed in different queues of the node using polling methods.

6. Consistency Policy

6.1 Conformance Level

The Cassandra uses eventual consistency. Final consistency is the fact that multiple copies of a data object in a distributed system may appear inconsistent for a short period of time, but after a period of time, the replicas will eventually reach a consistent point.

One feature of Cassandra is that it allows the user to specify the level of consistency (consistency levels) for each read/insert/delete operation. The Casssandra API currently supports several consistency levels:

ZERO: Only meaningful for insert or delete operations. The node responsible for performing the operation sends the modification to all the backup nodes, but does not wait for any one node to reply to the acknowledgement, so no consistency is guaranteed.

One: For an INSERT or delete operation, the execution node guarantees that the modification has been written to the commit log and memtable of a storage node; For a read operation, the execution node returns the result immediately after acquiring the data on one of the storage nodes.

QUORUM: Assume that the number of backup nodes for this data object is n. For insert or delete operations, it is guaranteed to write to at least n/2+1 storage nodes, and for read operations, queries to N/2+1 storage nodes and returns the latest timestamp data.

All: For an INSERT or delete operation, the execution node guarantees that n (n is replication factor) nodes are inserted or deleted successfully before returning a success acknowledgement message to the client, any one node is unsuccessful, the operation fails, and for the read operation, the n nodes are queried. Returns the most recent data for the timestamp, and similarly, if a node does not return data, the read failure is considered.

Note: Cassandra default read-write mode W (QUORUM)/R (QUORUM), in fact, as long as the W+r>n (N is the number of replicas), that is, the write node and read nodes overlap, is strong consistency. If w+r<=n, it is weak consistency. (where W is the number of write nodes and R is the number of read nodes).

If the user chooses the QUORUM level when reading and writing, it is guaranteed that the most recent changes will be made to each read operation. In addition, the Cassandra 0.6 version supports the any level for insert and delete operations, which means that data is guaranteed to be written to a storage node. Unlike the one level, any writes to the hinted handoff node as a success, and a requirement must be written to the final destination node.

6.2 Maintaining final consistency

Cassandra uses 4 techniques to maintain the final consistency of the data, which is inverse entropy (anti-entropy), read-fix (reading Repair), notification handover (hinted Handoff), and distributed deletion.

(1) Inverse entropy

This is a synchronization mechanism between backups. The consistency of data objects is checked periodically between nodes, and the method used for checking inconsistency is Merkle Tree;

(2) Read Repair

When the client reads an object, it triggers a consistency check on the object;

Example:

When reading key A's data, the system reads all copies of the data of key A, and if there are inconsistencies, it is fixed.

If the read consistency requirement is one, a recent copy of the data from the client is immediately returned. The read Repair is then executed in the background. This means that the data that is read for the first time may not be up-to-date, and if the read consistency requirement is quorum, a copy is returned to the client after reading more than half of the consistent copies, and the consistency check and repair of the remaining nodes is performed in the background; If the read consistency requirement is high (all), only read A consistent copy of the data will not be returned to the client until the repair is complete. Visible, this mechanism helps to reduce the final consistent time window.

(3) Prompt Surrender

For a write operation, if one of the target nodes is not in line, the object is relayed to another node first, and the target node such as relay node is put on the object to it;

Example:

Key a follows the rules of the primary write node to N1, and then copies it to N2. If the N1 is down, if the write N2 satisfies the consistencylevel requirement, the rowmutation of key A will encapsulate a header with hint information (containing the information that is targeted for N1) and then randomly write to a node N3, which is unreadable. While a copy of the data is normally copied to N2, this copy can provide read. If write N2 does not meet the write consistency requirement, the write fails. When N1 resumes, the message with hint headers that should have been written to N1 will be re-written back to N1.

(4) Distributed deletion

Single-machine deletion is very simple, only need to remove the data directly from the disk, and for the distributed, it is different, the difficulty of distributed deletion is: If a backup node of an object is not currently in line, and other backup nodes deleted the object, and so on when a again on the line, it does not know that the data has been deleted, Therefore, an attempt is made to restore this object on the other backup nodes, which invalidates the delete operation. Cassandra's solution is not to delete a data object immediately, but instead to tag the object with a hint that periodically garbage-reclaims objects that have been tagged with hint. Hint has existed until garbage collection, which allows other nodes to have the opportunity to get this hint from several other consistency assurance mechanisms. Cassandra solves this problem subtly by turning the delete operation into an insert operation.

7. Storage Policies

Cassandra's storage mechanism draws on BigTable's design, using Memtable and Sstable methods. Like a relational database, Cassandra also needs to record logs before writing the data, called Commitlog (the commit log in the database is divided into Undo-log, Redo-log, and Undo-redo-log, because Cassandra uses a timestamp to identify the old and new data without overwriting the existing data, so it does not need to use the undo operation, so its commit log uses the redo log, and then the data is written to the corresponding memtable of column family. And the data in the memtable is sorted by key. Memtable is a memory structure that satisfies certain conditions after batch flush (flush) to disk, stored as sstable. This mechanism, equivalent to the cache writeback mechanism (Write-back cache), has the advantage of turning random IO writes into sequential IO writes, reducing the pressure on the storage system for a large amount of write operations. Sstable once the write is complete, it cannot be changed, only read. The next time the memtable needs to be refreshed into a new sstable file. So for Cassandra, you can think of only sequential write, no random write operation.

Sstable is not modifiable, and in general, a CF may correspond to multiple sstable, so that when the user retrieves the data, if each sstable is scanned once, it will greatly increase the workload. Cassandra in order to reduce unnecessary sstable scanning, bloomfilter is used, that is, the key is mapped to a bitmap by multiple hash functions to quickly determine which sstable the key belongs to.

In order to reduce the cost of a large number of sstable, Cassandra will periodically compaction, simply put, compaction is to combine multiple sstable of the same CF into one sstable. In Cassandra, compaction's main tasks are:

(1) Garbage collection: Cassandra does not directly delete data, so disk space will be consumed more and more, compaction will be marked as deleted data is really deleted;

(2) Merge sstable:compaction merge multiple sstable into one (merged file including index file, data file, Bloom Filter file) to improve the efficiency of read operation;

(3) Generate Merkletree: The merkletree of data in this CF is generated during the merge process, which is used to compare and repair data with other storage nodes.

8. References

(1) "Cassandra consistency":

Http://www.thoss.org.cn/mediawiki/index.php/Cassandra_consistency

(2) "Cassandra data storage structure and data read and write":

http://www.oschina.net/question/12_11855

(3) "Cassandra Storage Mechanism":

Http://www.ningoo.net/html/2010/cassandra_storage.html

(4) Cassandra v.s. HBase:

Http://blog.sina.com.cn/s/blog_633f4ab20100r9nm.html

original articles, reproduced please specify: reproduced from Dong's blog

This article link address: http://dongxicheng.org/nosql/cassandra-strategy/

Author:Dong, author Introduction: http://dongxicheng.org/about/

article collection for this blog: http://dongxicheng.org/recommend/

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.