Spark reads HBase multiple tables to form an RDD

Source: Internet
Author: User

Spark reads HBase multiple tables to form an RDD

Environment: Spark-1.5.0 HBase-1.0.0.

Scenario: HBase stores data in a talent table and requires that data of any time period be combined into an RDD for subsequent computation.

Try 1: Find the API that reads multiple tables at a time and find the closest thing is MultiTableInputFormat, Which is used well in MapReduce,

However, the method for reading HBase from RDD is not found.

Try 2: generate an RDD for each table and merge it with union. The Code logic is as follows:

Var totalRDD = xxx // read the first table
For {// read the table cyclically and merge it into totalRDD
Val sRDD = xxx
TotalRDD. union (sRDD)
}

Code is put on the cluster for execution. totalRDD is not the correct union result, but var is not enough.

Try 3: The idea is similar to 2, but use SparkContext. union to merge multiple RDDs at a time. The Code logic is as follows:

Var rddSet: xxx = Set () // create the RDD list
DateSet. foreach (date => {// put the RDD of all tables into the list
Val sRDD = xxx
RddSet + = sRDD
}
Val totalRDD = SC. union (rddSet. toSeq) // merge all RDD in the list

The complete code is as follows:

Import java. text. SimpleDateFormat
Import org. apache. Hadoop. hbase. client. Result
Import org. apache. hadoop. hbase. io. ImmutableBytesWritable
Import org. apache. spark. rdd. RDD
Import org. apache. spark. {SparkContext, SparkConf}
Import org. apache. hadoop. hbase. HBaseConfiguration
Import org. apache. hadoop. hbase. mapreduce. TableInputFormat
Import scala. collection. mutable. Set

/**
* Time Processing
*/
Object Htime {
/**
* Get the date list based on the start and end dates
* For example, if the start and end time is 20160118, 20160120, the date list is (20160118, 20160119, 20160120)
*
* @ Param sDate start date
* @ Param eDate end date
* @ Return date list
*/
Def getDateSet (sDate: String, eDate: String): Set [String] = {
// Define the date list to be generated
Var dateSet: Set [String] = Set ()

// Define the Date Format
Val sdf = new SimpleDateFormat ("yyyyMMdd ")

// Convert the start time and end time to the number of milliseconds according to the date format defined above
Val sDate_ms = sdf. parse (sDate). getTime
Val eDate_ms = sdf. parse (eDate). getTime

// Calculate the number of milliseconds per day for subsequent iterations
Val day_ms = 24*60*60*1000

// Generate a date list cyclically
Var tm = sDate_ms
While (tm <= eDate_ms ){
Val dateStr = sdf. format (tm)
DateSet + = dateStr
Tm = tm + day_ms
}

// Return the date list
DateSet
}
}

/**
* Read behavior data from HBase to calculate group type
*/
Object Classify {
/**
* @ Param args command line parameter. The first parameter is the start date of the behavior data, and the second parameter is the end date, for example, 20160118.
*/
Def main (args: Array [String]) {
// The number of command line parameters must be 2
If (args. length! = 2 ){
System. err. println ("parameter quantity error ")
System. err. println ("Usage: Classify <start date> <End Date> ")
System. exit (1)
}

// Obtain the starting and ending dates of behavior data in the command line parameters
Val startDate = args (0)
Val endDate = args (1)

// Obtain the date list based on the start and end logs
// For example, if the start and end time is 20160118 20160120, the date list is (20160118 20160119, 20160120)
Val dateSet = Htime. getDateSet (startDate, endDate)

// Spark Context
Val sparkConf = new SparkConf (). setAppName ("Classify ")
Val SC = new SparkContext (sparkConf)

// Initialize HBase Configuration
Val conf = HBaseConfiguration. create ()

// Read multiple RDDs in one Set according to the date list, and merge them into one RDD using SparkContext. union ().
Var rddSet: Set [RDD [(ImmutableBytesWritable, Result)] = Set ()
DateSet. foreach (date => {
Conf. set (TableInputFormat. INPUT_TABLE, "behaviour_test _" + date) // you can specify the table name.
Val bRdd: RDD [(ImmutableBytesWritable, Result)] = SC. newAPIHadoopRDD (conf, classOf [TableInputFormat],
ClassOf [org. apache. hadoop. hbase. io. ImmutableBytesWritable],
ClassOf [org. apache. hadoop. hbase. client. Result])
RddSet + = bRdd
})

Val behavRdd = SC. union (rddSet. toSeq)

BehavRdd. collect (). foreach (println)
}
}

For more Spark tutorials, see the following:

Install and configure Spark in CentOS 7.0

Spark1.0.0 Deployment Guide

Install Spark0.8.0 in CentOS 6.2 (64-bit)

Introduction to Spark and its installation and use in Ubuntu

Install the Spark cluster (on CentOS)

Hadoop vs Spark Performance Comparison

Spark installation and learning

Spark Parallel Computing Model

Spark details: click here
Spark: click here

This article permanently updates the link address:

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.