When a Cassandra cluster needs to be invested, initialization data is usually used, for example, all the blog data on a blog website, all the webpage information on a data analysis website, and all the product information on an e-commerce website. These initialization data is often very large, and it is not applicable to directly using Thrift API to talk about Cassandra's client) Direct import. Facebook used the BinaryMemTable method to import large amounts of data into Cassandra using Binary Memtable ).
We installed Hadoop and Cassandra in the cluster. Assume that the data we need to initialize can be imported into a flat file txt file) and then uploaded to HDFS. Each machine is both a Cassandra node and a Hadoop Slave machine, and each Slave machine has 1 Reduce.
In order to import a large amount of data into the cluster, I personally think there are two solutions.
Use BinaryMemTable
1. Run mapcecejob
Partition the imported data by Key in Mapper.
In CER, perform the following operations in the configure phase:
1. initialize Cassandra's message service and Gossip service.
2. Create the Cassandra file directory.
3. Disable Cassandra's compression function.
4. Wait for the delay time of a Range.
In reduce, perform the following operations:
1. Create ColumnFamily corresponding to each key
2. Create a RowMutation message
3. Send messages to all nodes in the cluster that need to obtain the data.
In CER, perform the following operations in the close phase:
1. Wait until all messages in the message service are sent.
2. Complete Cassandra's message service and Gossip service.
2. Start the Cassandra Cluster
After Cassandra is started, manually perform the compression operation to merge a large number of previously generated SSTable files.
Generate an SSTable file by yourself
1. Start the Cassandra Cluster
After the cluster is started, make sure that the ring of the entire cluster has been established.
2. Run mapcecejob
In Mapper, perform the following operations in the configure phase:
1. randomly connect to a Cassandra Machine
2. Obtain the token map of the Cassandra Cluster
In Mapper, the map stage performs the following operations:
1. Divide data based on the Node Address corresponding to the data key
Secondary sorting
1. Use the data corresponding to the node address and key as a Group
2. Data in A Group is sorted in ascending order of keys.
In CER, perform the following operations in the configure phase:
1. Create an SStableWriter instance for each ColumnFamily.
In reduce, perform the following operations:
1. Create ColumnFamily corresponding to each key
2. Call the SStableWriter. append () method to write data to the specified SStable file.
In CER, perform the following operations in the close phase:
1. Call the SStableWriter. closeAndOpenReader () method of each ColumnFamily.
2. SCP the generated SSTable file to the data directory of Cassandra.
3. Restart the Cassandra cluster.
Original article title: Conception of Cassandra cluster data initialization Solution
Link: http://www.cnblogs.com/gpcuster/archive/2010/07/03/1770452.html