Integrate Cassandra with hadoop mapreduce

Source: Internet
Author: User
Tags cassandra hadoop mapreduce

When you see this title, you will certainly ask. How is this integration defined?

In my opinion, the so-called integration means that we can write mapreduceProgramRead data from HDFS and insert it into Cassandra. You can also directly read data from Cassandra and perform corresponding calculations.

Read data from HDFS and insert it into cassandra

For this type, follow these steps.

1. upload the data that needs to be inserted into cassandra to HDFS.

2. Start the mapreduce program.

Integration of this type is actually unrelated to Cassandra. We just run a common mapreduce program and insert the computed data into cassandra on the map or reduce side. That's all.

Read data directly from Cassandra and perform corresponding calculations.

This function is added in cassandra0.6.x. It can directly read the data required by mapreduce from Cassandra to implement full table scan for Cassandra.

The procedure is as follows:

1. Specify the keyspace, columnfamily, slicepredicate, and other Cassandra-related parameters used in mapreduce. (For details about these concepts, refer to "big talk Cassandra Data Model" and "talk about Cassandra client")

2. Start the mapreduce program.

There are many differences between this type of integration and Data Reading from HDFS:

1. Different Sources of input data: the former is reading input data from HDFS, and the latter is directly reading data from Cassandra.

2 hadoop versions are different: the former can use any version of hadoop, and the latter can only use hadoop0.20.x

Integrate hadoop0.19.x with cassandra0.6.x

In cassandra0.6.x, the integration with hadoop0.20.x is implemented by default, and we cannot directly use it in hadoop0.19.x.

So the first thing we need to do to achieve this goal is to modify Cassandra'sSource codeTo provide a function that can be used in hadoop0.19.x.

To perform this test, follow these steps:

1. Download the modified Code.

2. Specify the following content in mapreduce (note that the package used by the class here is com. Alibaba. DW. Cassandra. hadoop ):

 
Confighelper. setcolumnfamily (Conf, keyspace, membercf,

 
& Quot;/home/admin/Apache-Cassandra-0.6.1/conf & quot");

 
Slicepredicate predicate =NewSlicepredicate (). setcolumn_names (arrays. aslist ("City"

. Getbytes (utf8 ),"Employees_count". Getbytes (utf8 )));

 
Confighelper. setslicepredicate (Conf, predicate );

 
Confighelper. setrangebatchsize (Conf, 512 );

 
Confighelper. setsupercolumn (Conf,"Memberinfo");

3. Make sure that the specified directory of each machine running mapreduce is consistent with the storage-conf.xml file path set in mapreduce.

4. Run the mapreduce program.

Existing Problems and Improvements

In actual use, we will find this error message on the map side:

 
Java. Lang. runtimeexception: timedoutexception ()

 
At com. Alibaba. DW. Cassandra. hadoop. columnfamilyrecordreader $ rowiterator. maybeinit (columnfamilyrecordreader. Java: 125)

 
At com. Alibaba. DW. Cassandra. hadoop. columnfamilyrecordreader $ rowiterator. computenext (columnfamilyrecordreader. Java: 164)

 
At com. Alibaba. DW. Cassandra. hadoop. columnfamilyrecordreader $ rowiterator. computenext (columnfamilyrecordreader. Java: 1)

 
At com. Google. Common. Collect. abstractiterator. trytocomputenext (abstractiterator. Java: 135)

 
At com. Google. Common. Collect. abstractiterator. hasnext (abstractiterator. Java: 130)

 
At com. Alibaba. DW. Cassandra. hadoop. columnfamilyrecordreader. Next (columnfamilyrecordreader. Java: 224)

 
At com. Alibaba. DW. Cassandra. hadoop. columnfamilyrecordreader. Next (columnfamilyrecordreader. Java: 1)

 
At org. Apache. hadoop. mapred. maptask $ trackedrecordreader. movetonext (maptask. Java: 192)

 
At org. Apache. hadoop. mapred. maptask $ trackedrecordreader. Next (maptask. Java: 176)

 
At org. Apache. hadoop. mapred. maprunner. Run (maprunner. Java: 48)

 
At org. Apache. hadoop. mapred. maptask. Run (maptask. Java: 342)

 
At org. Apache. hadoop. mapred. Child. Main (child. Java: 158)

 
Caused by: timedoutexception ()

 
At org. Apache. Cassandra. Thrift. Cassandra $ get_range_slices_result.read (Cassandra. Java: 11015)

At org. Apache. Cassandra. Thrift. Cassandra $ client. recv_get_range_server Load balancer (Cassandra. Java: 623)

 
At org. Apache. Cassandra. Thrift. Cassandra $ client. get_range_slices (Cassandra. Java: 597)

 
At com. Alibaba. DW. Cassandra. hadoop. columnfamilyrecordreader $ rowiterator. maybeinit (columnfamilyrecordreader. Java: 108)

 
... 11 more

 

The reason for this problem is that the use of thrift API to read data from Cassandra fails.

So we can optimize this section.CodeTo provide the expected error handling function to provide program availability.

Another way is to modify the Cassandra configuration and adjust the rpctimeout time.

 

For more information about CassandraArticle: Http://www.cnblogs.com/gpcuster/tag/Cassandra/

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.