Conception of Cassandra cluster data initialization Solution

Source: Internet
Author: User
ArticleDirectory
    • 1. Run mapcecejob
    • 2. Start the Cassandra Cluster
    • 1. Start the Cassandra Cluster
    • 2. Run mapcecejob
    • 3. Restart the Cassandra cluster.

When a Cassandra cluster needs to be invested, initialization data is usually used, for example, all the blog data on a blog website, all the webpage information on a data analysis website, and all the product information on an e-commerce website. The initial data volume is usually very large, and it is not applicable to direct import using the thrift API (Talk About the Cassandra client. Facebook used the binarymemtable method (using binary memtable to import a large amount of data to Cassandra ).

We installed hadoop and Cassandra in the cluster. Assume that the data to be initialized can be imported into a flat file (txt file) and then uploaded to HDFS. Each machine is both a Cassandra node and a hadoop slave machine, and each slave machine has 1 reduce.

In order to import a large amount of data into the cluster, I personally think there are two solutions.

 

Run mapreducejob with binarymemtable1.

Partition the imported data by key in mapper.

In CER, perform the following operations in the configure phase:

1. initialize Cassandra's message service and gossip service.
2. Create the Cassandra file directory.
3. Disable Cassandra's compression function.
4. Wait for the delay time of a range.

    In reduce, perform the following operations:

    1. Create columnfamily corresponding to each key
    2. Create a rowmutation message
    3. Send messages to all nodes in the cluster that need to obtain the data.

      In CER, perform the following operations in the close phase:

      1. Wait until all messages in the message service are sent.
      2. Complete Cassandra's message service and gossip service.

        2. Start the Cassandra Cluster

        After Cassandra is started, manually perform the compression operation to merge a large number of previously generated sstable files.

         

        Generate the sstable file by yourself. 1. Start the Cassandra cluster.

        After the cluster is started, make sure that the Ring of the entire cluster has been established.

        2. Run mapcecejob

        In Mapper, perform the following operations in the configure phase:

        1. randomly connect to a Cassandra Machine
        2. Obtain the token map of the Cassandra Cluster

          In Mapper, the map stage performs the following operations:

          1. Divide data based on the Node Address corresponding to the data key

            Secondary sorting

            1. Use the data corresponding to the node address and key as a group
            2. Data in A group is sorted in ascending order of keys.

              In CER, perform the following operations in the configure phase:

              1. Create an sstablewriter instance for each columnfamily.

                In reduce, perform the following operations:

                1. Create columnfamily corresponding to each key
                2. Call the sstablewriter. append () method to write data to the specified sstable file.

                  In CER, perform the following operations in the close phase:

                  1. Call the sstablewriter. closeandopenreader () method of each columnfamily.
                  2. SCP the generated sstable file to the data directory of Cassandra.

                    3. Restart the Cassandra cluster.

                     

                    More about Cassandra: http://www.cnblogs.com/gpcuster/tag/Cassandra/

                    Contact Us

                    The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

                    If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

                    A Free Trial That Lets You Build Big!

                    Start building with 50+ products and up to 12 months usage for Elastic Compute Service

                    • Sales Support

                      1 on 1 presale consultation

                    • After-Sales Support

                      24/7 Technical Support 6 Free Tickets per Quarter Faster Response

                    • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.