Here, we start to build a Cassandra cluster.
I. Knowledge about Token
Token is a very important concept in Cassandra. It is an attribute that Cassandra uses to balance the loads of nodes in the cluster. Cassandra has different token allocation policies. We recommend that you use the default randompartitioner partition policy. In this policy, the token is a 0 ~ An integer between 127 to the power of 2 (this also means that theoretically Cassandra can support 127 to the power of 2 ). The reason is that the power of 2 is 127, because MD5 hash is fixed to output the number of 128 bits, remove one symbol bit, and the remaining 127 bits.
Cassandra will perform MD5 hash on the key when you insert data to get a 127-bit number, and then compare this number with the token of each node to determine the node to store. It selects nodes according to the following rules:
1. data will be stored on the node with the nearest token greater than the hash value of the key;
2. If the hash value of the key is greater than the maximum token, it will be stored on the node with the minimum token.
In the configuration file Cassandra. yaml, there is a configuration named initial_token, where the token value of the node is configured. When this value is left blank, Cassandra automatically assigns a token to the node based on the following rules:
1. If the node has been configured to be ready to join the cluster, Cassandra will allocate the most balanced token to the node based on the existing token of the cluster. Obviously, assigning a token to a newly added node will reduce the load of an existing node by half. If multiple tokens in the cluster are the same, it selects a token that can share the most data currently stored.
2. If the node is not ready to join the cluster, Cassandra considers the node as a pilot node and assigns a fixed value to it. Therefore, if you configure two nodes separately and try to cluster them again, duplicate tokens will be reported. The best way is to configure the first node as the pilot node separately, and then add the node configuration to the cluster at the same time (as described later ), in this way, they will automatically obtain a balanced token.
The token in the configuration file is used only when the system is started for the first time, and then the value is written to the system file. After the system is started, the token will not be read from the configuration file. Therefore, if you find that the token of the two nodes is repeated, you cannot change the token of the node by changing the token in the configuration file. The correct modification method is to delete all the files in the Custom Data folder and restart the service.
Token directly affects the load on nodes in the cluster. We should try to ensure that the token range of each node is balanced. If you find that the node load is unbalanced, You can manually change their tokens to balance them. Calculate the token (average allocation) of each node, and then use the nodetool tool to change it. The following describes how to calculate the token in Java. The parameter is the number of nodes.
public static void calToken(int nodesNum){DecimalFormat df = new DecimalFormat(".");for(int i = 0;i < nodesNum;i++){System.out.println(df.format((i * Math.pow(2, 127) / nodesNum)));}}
After obtaining the token, use nodetool for manual allocation. For example, I want to change the token of the node with the address 192.168.20.1 to 56713727820156410000000000000000000000:
Nodetool. Bat-H 192.168.20.1 move 56713727820156410000000000000000000000
Ii. modify the configuration file
After talking about the token, let's talk about how to configure the cluster. Configuration File modifications are still concentrated in the Cassandra. yaml file, mainly modifying the following attributes:
Cluster_name: Cluster name. All nodes in the cluster must have the same configuration.
Seeds: This is the seed node. Cassandra is a P2P Distributed Database with no central nodes. To ensure that the node can find the cluster, you must tell it at least one IP address of the node already in the cluster. In this way, you can find all nodes in the cluster.
Listen_address: the IP address that other nodes use to communicate with themselves. You must set the IP address of your host here. If it is set to localhost or 127.0.0.1, other nodes may not be able to communicate with themselves.
Rpc_address: Specifies whether clients on other nodes can communicate with their own servers. If it is set to 0.0.0.0, the client on any node can communicate with itself. Otherwise, only the local client can communicate with itself.
Here we use two machines A and B as examples. Their IP addresses are 192.168.20.1 and 192.168.20.2, respectively.
The configuration of A is as follows:
Cluster_name: 'firstcluster'
Seeds: "192.168.20.1, 192.168.20.2"
Listen_address: 192.168.20.1
Rpc_address: 0.0.0.0
B's configuration is as follows:
Cluster_name: 'firstcluster'
Seeds: "192.168.20.2, 192.168.20.1"
Listen_address: 192.168.20.2
Rpc_address: 0.0.0.0
In fact, the seeds configurations of A and B do not need to be the same, as long as one of them is configured with the IP address of the other party, so that they can communicate with each other. Therefore, if we want to add another node C, we only need to add the IP address of A or B to the seeds of C, and the seeds of A and B do not need to be changed. However, seeds is recommended on the official website to be the same, because it will be more robust. If C only writes the IP address of a, if A is disconnected, B and C are also disconnected. Therefore, you can configure the seeds of all nodes in the cluster as much as possible.
3. Start the Cluster
After these machines are started, if the above configuration is correct, they will automatically detect the other party and complete the cluster. To check whether the cluster is correct, run the ring command in the bin directory:
Nodetool. Bat-H localhost Ring
If the cluster is complete, information about all machines in the cluster is output. DC indicates the data center where they are located, status indicates whether they are online, State indicates whether the database of the node is normal, and load indicates the size of the data stored by the current node, owns indicates the expected data load calculated based on the token.
4. Some common operations
The data in the cluster is automatically distributed to the corresponding node. However, if we want to store multiple copies of the data, we can set the replication_factor attribute. Replication_factor indicates the number of copies of a data record. Previously, this attribute can be modified in the configuration file, but the latest Cassandra cancels this practice and you need to start the client to modify it. Go to bin and open the client Cassandra-CLI:
Connect localhost/9160;
Use demo;
Update keyspace demo with placement_strategy = 'org. Apache. Cassandra. locator. simplestrategy 'and strategy_options = {replication_factor: 3 };
This statement first changes placement_strategy from the default networktopologystrategy to simplestrategy, and then we can set the replication_factor attribute.
After the modification is complete, the data will automatically store multiple copies on different nodes, and the first data copy will be stored on the most suitable token node, the node selection for the second copy will increase progressively according to the token of the first copy.
However, this synchronization is not completed immediately. If you want to view them immediately, you can use the repair command for all nodes.
Nodetool. Bat-H <ip> repair
Then, use the ring command to increase the load on all nodes.
If a new node is added to the cluster, it will split part of the token range, however, the previously allocated data may not be re-distributed with the new token (here, the cluster automatically allocates the token. If you use the move command to manually re-allocate the token, then the data will be re-distributed ). If you want them to be re-distributed, they will automatically index their data after they are added to the new node by using the repair command. At this time, some data does not belong to the old node. We can use the cleanup command to clear redundant data:
Nodetool. Bat-H <ip> cleanup
Note: You 'd better confirm that the new machine can work normally before using this command.