MapReduce and Spark write hbase Multi-table summary

Source: Internet
Author: User

Syn Good son source: Http://www.cnblogs.com/cssdongl reprint please indicate the source

We all know that when using MapReduce or spark to write a table in a known hbase, declare the following code directly in the driver class of MapReduce or spark

Job.getconfiguration (). Set (tableoutputformat.output_table, tablename);

The MapReduce is then written directly to the context in mapper or reducer, and spark is constructed to contain the pairrddfunctions of the put and Saveashadoopdataset.

And often encounter some requirements are based on input data, after processing need to write hbase more than one table or table name is unknown, you need to follow a field in the data to construct the table name written to HBase.

Because the table name is unknown, so the tableoutputformat.output_table can not be set, then this requirement is easy to implement, respectively summarizing the implementation of MapReduce and spark (in fact, will find the same end)

I. MapReduce writes hbase multiple tables

Add the following code to the main method of the MR

Job.setoutputformatclass (Multitableoutputformat.  Class);

You can then construct the table name and put in the context of the mapper or reducer to write multiple hbase tables based on the related fields.

Two. Spark writes hbase multiple tables

Here, I write a number of hbase tables directly with the Spark streaming program I tested, on the code

Object Sparkstreamingwritetohbase {def main (args:array[string]): Unit={var masterurl= "Yarn-client"if(Args.length > 0) {MasterUrl= Args (0)} Val conf=NewSparkconf (). Setappname ("Write to several tables of Hbase"). Setmaster (MasterUrl) Val SSC=NewStreamingContext (Conf, Seconds (5)) Val topics= Set ("app_events") Val Brokers= Propertiesutil.getvalue ("broker_address") Val Kafkaparams=map[string, String] ("Metadata.broker.list", Brokers, "Serializer.class", "Kafka.serializer.StringEncoder") Val Hbasetablesuffix= "_clickcounts"Val hconf=hbaseconfiguration.create () Val Zookeeper= Propertiesutil.getvalue ("zookeeper_address") Hconf.set (Hconstants.zookeeper_quorum, ZOOKEEPER) Val jobconf=NewJobconf (hconf, This. GetClass) Val Kafkadstreams=kafkautils.createdirectstream[string, String, Stringdecoder, Stringdecoder] (SSC, Kafkaparams, topics) Val Appuser Clicks= Kafkadstreams.flatmap (Rdd ={val Data=Jsonobject.fromobject (rdd._2) Some (data)}). Map{jsonline=Val Key= Jsonline.getstring ("appId") + "_" + jsonline.getstring ("UID")) Val Value= Jsonline.getstring ("Click_count") (Key, value)} val result= Appuserclicks.map {item =Val RowKey=item._1 Val Value=item._2 converttohbaseput (RowKey, Value, Hbasetablesuffix)} result.foreachrdd {Rdd=Rdd.saveasnewapihadoopfile ("", classof[immutablebyteswritable], Classof[put], Classof[multitableoutputformat], jobconf)} Ssc.start () Ssc.awaittermination ()} def converttohbaseput (key:string, Value:string, tablenamesuffix:string): (Imm Utablebyteswritable, Put)={val RowKey=key Val TableName= Rowkey.split ("_") (0) +Tablenamesuffix Val put=NewPut (Bytes.tobytes (RowKey)) Put.addcolumn (Bytes.tobytes ("Info"), Bytes.tobytes ("Count"), bytes.tobytes (value)) (Newimmutablebyteswritable (Bytes.tobytes (tableName)), put)}}

Simply described, here in spark streaming is the JSON data read from Kafka, where the AppID field is used to construct TableName to distinguish between different hbase Table. Finally write the RDD to the HBase table with Saveasnewapihadoopfile

Entering the saveasnewapihadoopfile will find nothing different from the MapReduce configuration, as follows

 def saveasnewapihadoopfile (path:string, Keyclass:class[_], Valueclass: Class[_], outputformatclass:class[_  <: Newoutputformat[_, _], Conf:co Nfiguration  = self.context.hadoopConfiguration) { //  Rename this as hadoopconf internally to avoid shadowing (see SPARK-2038).  val hadoopconf = conf val job  = new  Span style= "COLOR: #000000" > Newapihadoopjob (hadoopconf) job.setoutputkeyclass (keyclass) job.setoutputvalueclass (VA Lueclass) Job.setoutputformatclass (outputformatclass) job.getConfiguration.set ( "Mapred.output.dir"  

The parameters of this method are ouput path, where it is written to HBase, passed in as empty, and other parameters outputkeyclass,outputvalueclass,outputformatclass,jobconf

The outputformatclass here ensures that it must be multitableoutputformat to ensure that multiple tables are written, yes, here's a point to make sure that the HBase table you're writing to is first create.

MapReduce and Spark write hbase Multi-table summary

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.