Spark Operation HBase

Source: Internet
Author: User

HBase after seven years of development, finally at the end of February this year, released the 1.0.0 version. This version offers some exciting features and, without sacrificing stability, introduces a new API. Although 1.0.0 is compatible with older APIs, you should familiarize yourself with the next version of the API as early as possible. and understand how to combine with the current red Spark to write and read data. Given that there is little information at home and abroad about the HBase 1.0.0 new API, pretend this article.

This article will be divided into two parts, the first part explains the use of HBase new API for CRUD basic operations, the second part explains how to write the RDDs within spark into the table of HBase, instead, how the table in HBase is loaded into Spark in RDDs form.

Environment configuration

To avoid unnecessary hassles with version inconsistencies, both the API and the HBase environment are 1.0.0 versions. HBase is a standalone mode, and distributed mode is used in a similar way, with only HBaseConfiguration the modified configuration.

Using SBT load dependencies in the development environment

Name: = "Sparklearn" Version: = "1.0" scalaversion: = "2.10.4" librarydependencies + = "Org.apache.spark" percent "Spark-core"% "1 .3.0 "Librarydependencies + =" org.apache.hbase "%" hbase-client "%" 1.0.0 "librarydependencies + =" Org.apache.hbase "%" Hbase-common "%" 1.0.0 "librarydependencies + =" org.apache.hbase "%" hbase-server "%" 1.0.0 "

CRUD Operations for HBase

The new API was added Connection , HAdmin became, Admin and HTable became Table , Admin and Table can only be Connection obtained through. Connectioncreation is a heavyweight operation, and because it Connection is thread-safe, it is recommended to use a singleton, whose factory method requires one HBaseConfiguration .

Val conf = hbaseconfiguration.create () conf.set ("Hbase.zookeeper.property.clientPort", "2181") Conf.set (" Hbase.zookeeper.quorum "," Master ") the creation of//connection is a heavyweight work, thread safe, is the entry Val conn of the Operation HBase = Connectionfactory.createconnection (CONF)

Create a table

Using Admin Create and delete tables

Val usertable = tablename.valueof ("user")//Create user table Val tabledescr = new Htabledescriptor (usertable) Tabledescr.addfamily (New Hcolumndescriptor ("Basic". GetBytes)) println ("Creating table ' user '.") if (admin.tableexists (usertable)) {  admin.disabletable (usertable)  admin.deletetable (usertable)} Admin.createtable (TABLEDESCR) println ("done!")

Insert, query, scan, delete operations

Operations on HBase need to first create an Action object Put , Get , Delete , and so on, and then call the corresponding method on the Table

try{//Get user table Val table = conn.gettable (usertable) try{//prepare to insert a key for id001 data val p = new Put ("id001". Getby TES)//Specify column and value for the put operation (the previous Put.add method was deprecated) P.addcolumn ("Basic". GetBytes, "name". GetBytes, "Wuchong". GetBytes )//Submit Table.put (P)//query a data val g = new Get ("id001". getBytes) Val result = Table.get (g) val value = Byt Es.tostring (Result.getvalue ("Basic". GetBytes, "name". GetBytes)) println ("GET id001:" +value)//scan data val s = new Sc An () S.addcolumn ("Basic". GetBytes, "name". GetBytes) Val scanner = Table.getscanner (s) try{for (R <-Scanne R) {println ("Found row:" +r) println ("Found Value:" +bytes.tostring (R.getvalue ("Basic". GetBytes, "n  Ame ". getBytes))}}finally {//Make sure scanner close Scanner.close ()}//Delete a piece of data, operate in a similar way as Put val d = new Delete ("id001". GetBytes) D.addcolumn ("Basic". GetBytes, "name". GetBytes) Table.delete (d)}finally {if (table! = N ull) Table.close ()}}finally {conn.close ()} 

Spark Operation HBase writes HBase

The first thing to do is write data to HBase, which we need to use PairRDDFunctions.saveAsHadoopDataset . Because HBase is not a file system, the saveAsHadoopFile method is useless.

def saveAsHadoopDataset(conf: JobConf): Unit
Output the RDD to any hadoop-supported storage system, using a Hadoop jobconf object for that storage system

This method requires a jobconf as a parameter, similar to a configuration item, which mainly requires specifying the output format and the output table name.

Step 1: We need to create a jobconf first.

Define the configuration of HBase val conf = hbaseconfiguration.create () conf.set ("Hbase.zookeeper.property.clientPort", "2181") Conf.set ( "Hbase.zookeeper.quorum", "Master")//Specify output format and output table name val jobconf = new jobconf (conf,this.getclass) Jobconf.setoutputformat (Classof[tableoutputformat]) jobconf.set (tableoutputformat.output_table, "user")

Step 2: RDD-to-table schema mapping
The table schema in HBase is generally the case:

row     cf:col_1    cf:col_2

In Spark, we are manipulating the RDD tuple, for example (1,"lilei",14) , (2,"hanmei",18) . We need to RDD[(uid:Int, name:String, age:Int)] convert into RDD[(ImmutableBytesWritable, Put)] . So, we define a convert function to do this conversion work

def convert (triple: (int, String, int)) = {      val p = new Put (bytes.tobytes (triple._1))      P.addcolumn (Bytes.tobytes ( "Basic"), Bytes.tobytes ("name"), Bytes.tobytes (triple._2))      P.addcolumn (Bytes.tobytes ("Basic"), Bytes.tobytes ( "Age"), Bytes.tobytes (Triple._3))      (new immutablebyteswritable, p)}

Step 3: read the RDD and convert

Read RDD data from somewhere and convertval rawdata = List ((1, "Lilei", +), (2, "Hanmei", +), (3, "someone", and)) Val Localda Ta = sc.parallelize (rawdata). Map (convert)

Step 4: saveAsHadoopDataset write to HBase using methods

Localdata.saveashadoopdataset (jobconf)

Read HBase

Spark reads HBase, and we primarily use SparkContext the provided API to load the contents of the newAPIHadoopRDD table into Spark in RDDs.

Val conf =Hbaseconfiguration.create ()
Conf.set ("Hbase.zookeeper.property.clientPort","2181")
Conf.set ("Hbase.zookeeper.quorum","Master")

Set the table name of the query
Conf.set (Tableinputformat.Input_table,"User")

Val Usersrdd = Sc.newapihadooprdd (conf, classof[Tableinputformat],
Classof[org.apache.hadoop.hbase.io.Immutablebyteswritable],
Classof[org.apache.hadoop.hbase.client.Result])

Val count = Usersrdd.count ()
println"Users RDD Count:" + count)
usersrdd.cache ()

//traverse output
usersrdd.foreach{case (_,result) =>
val key = bytes.toint (result.getrow)
val name = bytes.tostring (Result.getvalue ( "name". GetBytes)
val age = bytes.toint (Result.getvalue ( "age". GetBytes))
println ( "Row key:" +key+ "Age:" +age)
}

Spark Operation HBase

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.