MapReduce and Spark write hbase Multi-table summary

Last Update:2016-12-27 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Syn Good son source: Http://www.cnblogs.com/cssdongl reprint please indicate the source

We all know that when using MapReduce or spark to write a table in a known hbase, declare the following code directly in the driver class of MapReduce or spark

Job.getconfiguration (). Set (tableoutputformat.output_table, tablename);

The MapReduce is then written directly to the context in mapper or reducer, and spark is constructed to contain the pairrddfunctions of the put and Saveashadoopdataset.

And often encounter some requirements are based on input data, after processing need to write hbase more than one table or table name is unknown, you need to follow a field in the data to construct the table name written to HBase.

Because the table name is unknown, so the tableoutputformat.output_table can not be set, then this requirement is easy to implement, respectively summarizing the implementation of MapReduce and spark (in fact, will find the same end)

I. MapReduce writes hbase multiple tables

Add the following code to the main method of the MR

Job.setoutputformatclass (Multitableoutputformat.  Class);

You can then construct the table name and put in the context of the mapper or reducer to write multiple hbase tables based on the related fields.

Two. Spark writes hbase multiple tables

Here, I write a number of hbase tables directly with the Spark streaming program I tested, on the code

Object Sparkstreamingwritetohbase {def main (args:array[string]): Unit={var masterurl= "Yarn-client"if(Args.length > 0) {MasterUrl= Args (0)} Val conf=NewSparkconf (). Setappname ("Write to several tables of Hbase"). Setmaster (MasterUrl) Val SSC=NewStreamingContext (Conf, Seconds (5)) Val topics= Set ("app_events") Val Brokers= Propertiesutil.getvalue ("broker_address") Val Kafkaparams=map[string, String] ("Metadata.broker.list", Brokers, "Serializer.class", "Kafka.serializer.StringEncoder") Val Hbasetablesuffix= "_clickcounts"Val hconf=hbaseconfiguration.create () Val Zookeeper= Propertiesutil.getvalue ("zookeeper_address") Hconf.set (Hconstants.zookeeper_quorum, ZOOKEEPER) Val jobconf=NewJobconf (hconf, This. GetClass) Val Kafkadstreams=kafkautils.createdirectstream[string, String, Stringdecoder, Stringdecoder] (SSC, Kafkaparams, topics) Val Appuser Clicks= Kafkadstreams.flatmap (Rdd ={val Data=Jsonobject.fromobject (rdd._2) Some (data)}). Map{jsonline=Val Key= Jsonline.getstring ("appId") + "_" + jsonline.getstring ("UID")) Val Value= Jsonline.getstring ("Click_count") (Key, value)} val result= Appuserclicks.map {item =Val RowKey=item._1 Val Value=item._2 converttohbaseput (RowKey, Value, Hbasetablesuffix)} result.foreachrdd {Rdd=Rdd.saveasnewapihadoopfile ("", classof[immutablebyteswritable], Classof[put], Classof[multitableoutputformat], jobconf)} Ssc.start () Ssc.awaittermination ()} def converttohbaseput (key:string, Value:string, tablenamesuffix:string): (Imm Utablebyteswritable, Put)={val RowKey=key Val TableName= Rowkey.split ("_") (0) +Tablenamesuffix Val put=NewPut (Bytes.tobytes (RowKey)) Put.addcolumn (Bytes.tobytes ("Info"), Bytes.tobytes ("Count"), bytes.tobytes (value)) (Newimmutablebyteswritable (Bytes.tobytes (tableName)), put)}}

Simply described, here in spark streaming is the JSON data read from Kafka, where the AppID field is used to construct TableName to distinguish between different hbase Table. Finally write the RDD to the HBase table with Saveasnewapihadoopfile

Entering the saveasnewapihadoopfile will find nothing different from the MapReduce configuration, as follows

 def saveasnewapihadoopfile (path:string, Keyclass:class[_], Valueclass: Class[_], outputformatclass:class[_  <: Newoutputformat[_, _], Conf:co Nfiguration  = self.context.hadoopConfiguration) { //  Rename this as hadoopconf internally to avoid shadowing (see SPARK-2038).  val hadoopconf = conf val job  = new  Span style= "COLOR: #000000" > Newapihadoopjob (hadoopconf) job.setoutputkeyclass (keyclass) job.setoutputvalueclass (VA Lueclass) Job.setoutputformatclass (outputformatclass) job.getConfiguration.set ( "Mapred.output.dir"

The parameters of this method are ouput path, where it is written to HBase, passed in as empty, and other parameters outputkeyclass,outputvalueclass,outputformatclass,jobconf

The outputformatclass here ensures that it must be multitableoutputformat to ensure that multiple tables are written, yes, here's a point to make sure that the HBase table you're writing to is first create.

MapReduce and Spark write hbase Multi-table summary

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

MapReduce and Spark write hbase Multi-table summary

Contact Us

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support