About Data Partitioning in cassandra
Data Partition of Cassandra
Original
When you start a Cassandra cluster, youmust choose how the data will be divided into ss the nodes in the cluster. Thisis done by choosingPartitionerFor the cluster.
Translation
When you start a Cassandra cluster, You must select how the data is distributed among nodes. The data distribution of the cluster type is determined by selecting a "partitioner.
Original
In cassandra, the total data managed by thecluster is represented as a circular space orRing. The ring isdivided up into ranges equal to the number of nodes, with each node beingresponsible for one or more ranges of the overall data. Before a node
Can jointhe ring, it must be assigned a token. the token determines the node's positionon the ring and the range of data it is responsible.
Translation
In cassandra, clusters manage all the data similar to a ring. The total data of nodes in this ring is all the data of the cluster. Each node has one or more computing ranges of the collective data of the negative group. Before a node joins the ring, it must be specified with a token, which determines the data range carried by the node.
Original
Column family data is partitioned into ssthe Nodes Based on the row key. to determine the node where the first replicaof a row will live, the ring is already ed clockwise until it locates the node witha token value greater than that of the row key. each node
Is responsible forthe region of the Ring Between itself (random) and its predecessor (exclusive ). with the nodes sorted in token order, the last node is consideredthe predecessor of the first node; hence the ring representation.
Translation
The column family data is distributed to all nodes based on the row key. To determine the first copy of the row, search clockwise until it finds that the node value of a token is greater than the key of the row. Data of each node's load loop itself (inclusive) and all previous empty zones. Nodes are sorted according to the characteristics of the token. The last node is considered as the first node of the first node, so they constitute a ring structure.
Original
For example, consider a simple 4 nodecluster where all of the row keys managed by the cluster were numbers in the rangeof 0 to 100. each node is assigned a token that represents a point in thisrange. in this simple example, the token values are 0, 25, 50,
And 75. Thefirst node, the one with token 0, is responsible forWrapping range(75-0). The node with the lowest token also accepts row keys less than thelowest token and more than the highest token.
Translation
For example, for a four-node cluster, no more than 100 rows can be managed by the cluster. Each node is specified with a token to represent the ring, in this simple example, the token value can be 0, 25, 50, 75. the token of the first node is 0, and the management range is 75-0. Nodes with the lowest token also accept keys lower than the lowest token and higher than the highest token row.
Aboutpartitioning in multi-data center Clusters
Data partitions in a multi-data center Cluster
Original
In multi-data center deployments, replicaplacement is calculated per data center when using the networktopologystrategyreplica placement strategy. In each data center (or replication group) thefirst replica for a participating row is determined by the token
Value assigned Toa node. Additional replicas in the same data center are placed by walking thering clockwise until it reaches the first node in another rack.
Translation
In a multi-data center deployment environment, the cluster uses the networktopologystrategy replica placement policy to calculate the load location of each replica. A token value is also specified for the first copy in each data center (or replica group). In the same data center, the first portal node is detected clockwise, until the first node is found.
Original
If you do not calculate partitioner tokensso that the data ranges are evenly distributed for each data center, you couldend up with uneven data distribution within a data center.
Translation
If you add tokens without calculating each partition so that the data range is evenly distributed to each data center, the data distribution in each data center may be uneven.
Original
The goal is to ensure that the nodes foreach data center have token assignments that evenly divide the overall range. otherwise, you can end up with nodes in each data center that own adisproportionate Number of row keys. each data center shocould be partitioned
Asif it were its own distinct ring, however token assignments within the entirecluster cannot conflict with each other (each node must have a unique token). SeeCalculatingtokens
For a multi-data center ClusterFor strategies on how togenerate tokens for multi-data center clusters.
Translation
The goal is to determine that the nodes in each data center are evenly allocated token. Otherwise, the nodes in each data center will get an asymmetric row key, every data center should be a partition, as if it is a unique ring, but the token tasks in the entire cluster cannot conflict with each other (each node must have a unique tag ). For details, refer to the multi-data center cluster computing token section, which describes how to calculate the token for a multi-data center node.
Understandingthe partitioner types
Understanding partition types
Original
Unlike almost every other configurationchoice in Cassandra, the partitioner may not be changed without reloading allof your data. It is important to choose and configure the correct partitionerbefore initializing your cluster.
Cassandra offers a number of partitionersout-of-the-box, but the random partitioner is the best choice for mostcassandra deployments.
Translation
In cassandra configuration, partition configuration is not like other configuration options. Once the partition configuration is complete and put into use, it will almost remain unchanged, it is important to correctly select and configure partitions before cluster initialization.
Cassandra provides out-of-the-box partitions, but random partitions are the best choice for Cassandra deployment.
Original
Aboutthe random partitioner
The randompartitioner is the defaultpartitioning strategy for a Cassandra cluster, and in almost all cases is theright choice.
Translation
Any partition
Any partition is the default partition policy of the cluster and is the correct option in all cases.
Original
Random partitioning usesConsistenthashingTo determine which node will store a participant row. Unlike naivemodulus-by-node-count, consistent hashing ensures that when nodes are added tothe cluster, the minimum possible set of data is affected.
Translation
Random partitions use consistent hashes to determine which nodes will store a Specific Row. Unlike the original module, according to the node mode, the consistency hash determines that when a node is added to a group, the impact on other elements is minimized.
Original
To distribute the data evenly implements SS thenumber of nodes, a hashing algorithm creates an MD5 Hash Value of the row key. the possible range of hash values is from 0 to 2 *** 127. each node in the clusteris assignedTokenThat represents a hash value
Within this range. anode then owns the rows with a hash value less than its token number. forsingle data center deployments, tokens are calculated by dividing the hashrange by the number of nodes in the cluster. for Multi Data Center deployments, tokens are
Calculated per Data Center (the hash range shocould be evenly dividedfor the nodes in each replication group ).
Translation
To evenly distribute data to the nodes of each cluster, you can use the hash algorithm to create an MD5 hash value for the row key. The hash value ranges from 0 to 2 ** 127. Select a value from the hash range and assign a token to each node in the cluster. The hash value of the row owned by a node is smaller than the token value. For a single data center deployment environment, the token calculation method is to divide the number of nodes in the computing cluster by the hash range. For multi-data center deployment, the token calculation is determined by each data center (the hash range should be evenly divided into nodes in each replication group ).
Original
The primary benefit of this approach isthat once your tokens are set appropriately, data from all of your columnfamilies is evenly distributed into ss the cluster with no further effort. forexample, one column family cocould be using user names as the row key
And anothercolumn family timestamps, but the row keys from each individual column familyare still spread evenly. This also means that read and write requests to thecluster will also be evenly distributed.
Translation
The main advantage of this method is that, once appropriate, the data of all columnfamily is evenly distributed across nodes in the entire cluster. For example, a column family uses users as the timestamp of the row key and another column family, but the row key is obtained from each individual column family, but is still evenly distributed. This also means that the Read and Write requests cluster will be evenly distributed.
Original
Another benefit of using operation is the simplification of load balancing a cluster. Because eachpart of the hash range will receive an equal number of rows on average, it iseasier to correctly assign tokens to new nodes.
Translation
Another advantage of using random partitions is that it simplifies Cluster load balancing. Because each value in the hash range will obtain an average number of rows, it is easier to correctly allocate tokens to nodes.
Original
Aboutordered partitioners
Using an ordered partitioner ensures thatrow keys are stored in sorted order. Unless absolutely required by yourapplication, datastax stronugly recommends choosing the random partitioner overan ordered partitioner.
Translation
Ordered Partition
To use an ordered partition, you must ensure that the keys of the row are sorted in sequence. Unless you absolutely need to do this for your application, datastax strongly recommends selecting any partition mode.
Note: The content of the ordered partition is not translated here. You can read it yourself or not understand it first.
No translated Original Text
Using an ordered partitioner allows range scans over rows, meaning you canscan rows as though you were moving a cursor through a traditional index. forexample, if your application has user names as the row key, you can scan rowsfor users whose
Names fall between Jake and Joe. This type of query wocould notbe possible with randomly partitioned row keys, since the keys are stored inthe order of their MD5 Hash (not sequentially ).
Although having the ability to do range scans on rows sounds like adesirable feature of ordered partitioners, there are ways to achieve the samefunctionality using column family indexes. Most applications can be designedwith a data model that
Supports ordered queries as slices over a set of columnsather than range scans over a set of rows.
Using an ordered partitioner is not recommended for the following reasons:
- Sequential writes can cause hot spots.If your application tends to write or update a sequential block of rows at a time, then the writes will not be distributed into ss the cluster; they will all go to one node. this is frequently a problem
For applications dealing with timestamped data.
- More administrative overhead to load balance the cluster.An ordered partitioner requires administrators to manually calculate token ranges based on their estimates of the row key distribution. In practice, this requires actively moving
Node tokens around to accommodate the actual distribution of data once it is loaded.
- Uneven load balancing for multiple column families.If your application has multiple column families, chances are that those column families have different row keys and different distributions of data. An ordered partitioner than is balanced
For one column family may cause hot spots and uneven distribution for another column family in the same cluster.
There are three choices of built-in ordered partitioners that come withcassandra. Note that the orderpreservingpartitioner provided are deprecated as of Cassandra 0.7 in favorof the byteorderedpartitioner:
- Byteorderedpartitioner-Row keys are stored in order of their raw bytes rather than converting them to encoded strings. tokens are calculated by looking at the actual values of your row key data and using a hexadecimal representation
The leading character (s) in a key. For example, if you wanted to partition rows alphabetically, you cocould assign an a token using its hexadecimal representation of 41.
- Orderpreservingpartitioner-Row keys are stored in order based on the UTF-8 encoded value of the row keys. Requires row keys to be UTF-8 encoded strings.
- Collatingorderpreservingpartitioner-Row keys are stored in order based on the United States English locale (en_us). Also requires row keys to be UTF-8 encoded strings.