HBase after seven years of development, finally at the end of February this year, released the 1.0.0 version. This version offers some exciting features and, without sacrificing stability, introduces a new API. Although 1.0.0 is compatible with older APIs, you should familiarize yourself with the next version of the API as early as possible. and understand how to combine with the current red Spark to write and read data. Given that there is little information at home and abroad about the HBase 1.0.0 new API, pretend this article.
This article will be divided into two parts, the first part explains the use of HBase new API for CRUD basic operations, the second part explains how to write the RDDs within spark into the table of HBase, instead, how the table in HBase is loaded into Spark in RDDs form.
Environment configuration
To avoid unnecessary hassles with version inconsistencies, both the API and the HBase environment are 1.0.0 versions. HBase is a standalone mode, and distributed mode is used in a similar way, with only HBaseConfiguration
the modified configuration.
Using SBT load dependencies in the development environment
Name: = "Sparklearn" Version: = "1.0" scalaversion: = "2.10.4" librarydependencies + = "Org.apache.spark" percent "Spark-core"% "1 .3.0 "Librarydependencies + =" org.apache.hbase "%" hbase-client "%" 1.0.0 "librarydependencies + =" Org.apache.hbase "%" Hbase-common "%" 1.0.0 "librarydependencies + =" org.apache.hbase "%" hbase-server "%" 1.0.0 "
|
CRUD Operations for HBase
The new API was added Connection
, HAdmin
became, Admin
and HTable
became Table
, Admin
and Table
can only be Connection
obtained through. Connection
creation is a heavyweight operation, and because it Connection
is thread-safe, it is recommended to use a singleton, whose factory method requires one HBaseConfiguration
.
Val conf = hbaseconfiguration.create () conf.set ("Hbase.zookeeper.property.clientPort", "2181") Conf.set (" Hbase.zookeeper.quorum "," Master ") the creation of//connection is a heavyweight work, thread safe, is the entry Val conn of the Operation HBase = Connectionfactory.createconnection (CONF)
|
Create a table
Using Admin
Create and delete tables
Val usertable = tablename.valueof ("user")//Create user table Val tabledescr = new Htabledescriptor (usertable) Tabledescr.addfamily (New Hcolumndescriptor ("Basic". GetBytes)) println ("Creating table ' user '.") if (admin.tableexists (usertable)) { admin.disabletable (usertable) admin.deletetable (usertable)} Admin.createtable (TABLEDESCR) println ("done!")
|
Insert, query, scan, delete operations
Operations on HBase need to first create an Action object Put
, Get
, Delete
, and so on, and then call the corresponding method on the Table
try{//Get user table Val table = conn.gettable (usertable) try{//prepare to insert a key for id001 data val p = new Put ("id001". Getby TES)//Specify column and value for the put operation (the previous Put.add method was deprecated) P.addcolumn ("Basic". GetBytes, "name". GetBytes, "Wuchong". GetBytes )//Submit Table.put (P)//query a data val g = new Get ("id001". getBytes) Val result = Table.get (g) val value = Byt Es.tostring (Result.getvalue ("Basic". GetBytes, "name". GetBytes)) println ("GET id001:" +value)//scan data val s = new Sc An () S.addcolumn ("Basic". GetBytes, "name". GetBytes) Val scanner = Table.getscanner (s) try{for (R <-Scanne R) {println ("Found row:" +r) println ("Found Value:" +bytes.tostring (R.getvalue ("Basic". GetBytes, "n Ame ". getBytes))}}finally {//Make sure scanner close Scanner.close ()}//Delete a piece of data, operate in a similar way as Put val d = new Delete ("id001". GetBytes) D.addcolumn ("Basic". GetBytes, "name". GetBytes) Table.delete (d)}finally {if (table! = N ull) Table.close ()}}finally {conn.close ()}
|
Spark Operation HBase writes HBase
The first thing to do is write data to HBase, which we need to use PairRDDFunctions.saveAsHadoopDataset
. Because HBase is not a file system, the saveAsHadoopFile
method is useless.
def saveAsHadoopDataset(conf: JobConf): Unit
Output the RDD to any hadoop-supported storage system, using a Hadoop jobconf object for that storage system
This method requires a jobconf as a parameter, similar to a configuration item, which mainly requires specifying the output format and the output table name.
Step 1: We need to create a jobconf first.
Define the configuration of HBase val conf = hbaseconfiguration.create () conf.set ("Hbase.zookeeper.property.clientPort", "2181") Conf.set ( "Hbase.zookeeper.quorum", "Master")//Specify output format and output table name val jobconf = new jobconf (conf,this.getclass) Jobconf.setoutputformat (Classof[tableoutputformat]) jobconf.set (tableoutputformat.output_table, "user")
|
Step 2: RDD-to-table schema mapping
The table schema in HBase is generally the case:
row cf:col_1 cf:col_2
In Spark, we are manipulating the RDD tuple, for example (1,"lilei",14)
, (2,"hanmei",18)
. We need to RDD[(uid:Int, name:String, age:Int)]
convert into RDD[(ImmutableBytesWritable, Put)]
. So, we define a convert function to do this conversion work
def convert (triple: (int, String, int)) = { val p = new Put (bytes.tobytes (triple._1)) P.addcolumn (Bytes.tobytes ( "Basic"), Bytes.tobytes ("name"), Bytes.tobytes (triple._2)) P.addcolumn (Bytes.tobytes ("Basic"), Bytes.tobytes ( "Age"), Bytes.tobytes (Triple._3)) (new immutablebyteswritable, p)}
|
Step 3: read the RDD and convert
Read RDD data from somewhere and convertval rawdata = List ((1, "Lilei", +), (2, "Hanmei", +), (3, "someone", and)) Val Localda Ta = sc.parallelize (rawdata). Map (convert)
|
Step 4: saveAsHadoopDataset
write to HBase using methods
Localdata.saveashadoopdataset (jobconf)
|
Read HBase
Spark reads HBase, and we primarily use SparkContext
the provided API to load the contents of the newAPIHadoopRDD
table into Spark in RDDs.
Val conf =Hbaseconfiguration.create () Conf.set ("Hbase.zookeeper.property.clientPort","2181") Conf.set ("Hbase.zookeeper.quorum","Master")
Set the table name of the query Conf.set (Tableinputformat.Input_table,"User")
Val Usersrdd = Sc.newapihadooprdd (conf, classof[Tableinputformat], Classof[org.apache.hadoop.hbase.io.Immutablebyteswritable], Classof[org.apache.hadoop.hbase.client.Result])
Val count = Usersrdd.count () println"Users RDD Count:" + count) usersrdd.cache () //traverse output usersrdd.foreach{case (_,result) => val key = bytes.toint (result.getrow) val name = bytes.tostring (Result.getvalue ( "name". GetBytes) val age = bytes.toint (Result.getvalue ( "age". GetBytes)) println ( "Row key:" +key+ "Age:" +age) } |
Spark Operation HBase