Spark Operation HBase

Last Update:2015-10-27 Source: Internet

Author: User

Tags zookeeper client

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

To spark it is a computational framework in the spark environment that not only supports single file operations, HDFs files, but also uses spark for hbase operations.

Extracted from the enterprise data source HBase. This involves reading hbase data, as soon as possible in this article so that we can practice and manipulate hbase. The Spark Shell for hbase operations.

First, the environment:

Haoop2.2.0

HBase version number 0.96.2-HADOOP2, r1581096

Spark1.0.0

In this article, if the environment is already set up, Spark Haoop cluster is built

Hadoop2.2.0 be aware of the compatibility with HBase version number, where HBase is used 0.96.2

Second, the principle

The spark operation HBase is in fact consistent with the Java client Operation HBase principle:

Both Scala and Java are JVM-based languages. Simply loading the HBase class into Classpath, you can invoke the operation, and the other frames are similar.

The same point : that is to connect Hmaster as a client, and then use HBase's API to manipulate HBase.

The difference : The only difference is that spark can use the data from HBase as an rdd to take advantage of spark for parallel computing.

Third, Practice 1, first check the dependent jar package. Before this, if HBase's jar package is not in Spark-shell's classpath. You need to join in. Setup method: Add spark_classpath=/home/victor/software/hbase/lib/* in spark-evn.sh to start Bin/spark-shell again, After the startup is complete and the worker is successfully registered. Import jar package.
2, Operation hbase2.1 hbase data in HBase there is a score table, there are 2 cf. They are course and grade respectively. Data such as the following:

HBase (main):001:0> scan ' scores ' ROW                                    column+cell                   &NBSP ;                          ,         &NB Sp                          ,         &NB Sp          jim                       &N Bsp            Column=course:art, timestamp=1404142440676, value=67                                    ,         &N Bsp               jim                   and nbsp;               Column=course:math, timestamp=1404142434405, value=77     &NBSP ;                          ,         &NB Sp                  jim               &N Bsp                   Column=grade:, timestamp=1404142422653, value=3   &NBS P                          ,         &NB Sp                          tom       &N Bsp                           Column=course:art, timestamp=14 04142407018, value=88                            /nbsp                               tom   & nbsp                               Column=course:mat H, timestamp=1404142398986, value=97                         & nbsp                               &NBSP;&NBSP;&NBSP ;  Tom                                 Column=grade:, timestamp=1404142383206, value=5                     &NB Sp                          ,         &NB Sp        shengli                       &NBsp        Column=course:art, timestamp=1404142468266, value=17                                        ,         &N Bsp           shengli                     &NBS P         Column=course:math, timestamp=1404142461952, value=27           &NBSP ;                          ,         &NB Sp            shengli                   &NBSP ;           Column=grade:, timestamp=1404142452157, value=8           &NBS P                          ,         &NBSp                 3 Row (s) in 0.3230 seconds

2.1 Initializing Connection parameters

Scala> Import org.apache.spark._import org.apache.spark._scala> Import Org.apache.spark.rdd.NewHadoopRDDimport org.apache.spark.rdd.newhadooprddscala> Import Org.apache.hadoop.conf.Configuration;  import org.apache.hadoop.conf.configurationscala> import org.apache.hadoop.hbase.HBaseConfiguration;  import org.apache.hadoop.hbase.hbaseconfigurationscala> Import Org.apache.hadoop.hbase.mapreduce.TableInputFormatimport org.apache.hadoop.hbase.mapreduce.tableinputformatscala> val configuration = Hbaseconfiguration.create ();  //initialization Configuration Configuration:org.apache.hadoop.conf.Configuration = Configuration:core-default.xml, Core-site.xml, Mapred-default.xml, Mapred-site.xml, Yarn-default.xml, Yarn-site.xml, Hbase-default.xml, hbase-site.xmlscala> Configuration.set ("Hbase.zookeeper.property.clientPort", "2181"); Set Zookeeper client port scala> configuration.set ("Hbase.zookeeper.quorum", "localhost");  //Set Zookeeper quorumscala> configuration.set ("HBASE.MAster "," localhost:60000 ");  //set HBase masterscala> configuration.addresource ("/home/victor/software/hbase/conf/hbase-site.xml")  //to load the configuration of hbase into

scala> tableinputformat.input_ TABLE, "scores"  )  
 scala> Import Org.apache.hadoop.hbase.client.HBaseAdminimport Org.apache.hadoop.hbase.client.HBaseAdmin

scala> val hadmin = new Hbaseadmin (configuration); Instantiate HBase management 2014-07-01 00:39:24,649 INFO [main] zookeeper. ZooKeeper (zookeeper.java:<init> (438))-Initiating client connection, connectstring=localhost:2181 sessiontimeout=90000 Watcher=hconnection-0xc7eea5, quorum=localhost:2181, baseznode=/hbase2014-07-01 00:39:24,707 INFO [main] zookeeper. Recoverablezookeeper (recoverablezookeeper.java:<init>)-Process IDENTIFIER=HCONNECTION-0XC7EEA5 Connecting to ZooKeeper ensemble=localhost:21812014-07-01 00:39:24,753 INFO [Main-sendthread (localhost:2181)] Zookeeper. Clientcnxn (ClientCnxn.java:logStartConnect (966))-Opening socket connection to server localhost/127.0.0.1:2181.  Won't attempt to authenticate using SASL (unknown error) 2014-07-01 00:39:24,755 INFO [Main-sendthread (localhost:2181)] Zookeeper. CLIENTCNXN (ClientCnxn.java:primeConnection (849))-Socket connection established to localhost/127.0.0.1:2181, Initiating session2014-07-01 00:39:24,938 INFO [main-seNdthread (localhost:2181)] zookeeper. CLIENTCNXN (ClientCnxn.java:onConnected (1207))-Session establishment complete on server localhost/127.0.0.1:2181, SessionID = 0x146ed61c4ef0015, negotiated timeout = 40000hadmin:org.apache.hadoop.hbase.client.hbaseadmin = [email  protected]

Next, use the Haoop API to create an RDD

scala> val hrdd = Sc.newapihadooprdd (configuration, Classof[tableinputformat],      | classof[ Org.apache.hadoop.hbase.io.ImmutableBytesWritable],     | classof[org.apache.hadoop.hbase.client.result]) 2014-07-01 00:51:06,683 WARN  [main] util. Sizeestimator (Logging.scala:logWarning)-Failed to check whether Usecompressedoops is set; Assuming yes2014-07-01 00:51:06,936 INFO  [main] storage. Memorystore (Logging.scala:logInfo)-Ensurefreespace (85877) called with curmem=0, maxmem=3089104892014-07-01 00:51:06,946 INFO  [main] storage.  Memorystore (Logging.scala:logInfo)-Block broadcast_0 stored as values to memory (estimated size 83.9 KB, free 294.5 MB) hrdd:org.apache.spark.rdd.rdd[(org.apache.hadoop.hbase.io.ImmutableBytesWritable, Org.apache.hadoop.hbase.client.Result)] = newhadooprdd[0] at Newapihadooprdd at <console>:22

Version number one: (Latest version number The following code may not work, please see version number II) read the record: Here we take 1 data, can see the format is according to our set of Hadooprdd. Key is a constant immutablebyteswritable,value is the result of HBase

Scala> Hrdd take 12014-07-01 00:51:50,371 INFO [main] spark. Sparkcontext (Logging.scala:logInfo)-Starting job:take at <console>:252014-07-01 00:51:50,423 INFO [spark-a KKA.ACTOR.DEFAULT-DISPATCHER-16] Scheduler. Dagscheduler (Logging.scala:logInfo)-Got Job 0 (take for <console>:25) with 1 output partitions (allowlocal=tru e) 2014-07-01 00:51:50,425 INFO [spark-akka.actor.default-dispatcher-16] scheduler. Dagscheduler (Logging.scala:logInfo)-Final Stage:stage 0 (take at <console>:25) 2014-07-01 00:51:50,426 INFO [ SPARK-AKKA.ACTOR.DEFAULT-DISPATCHER-16] Scheduler. Dagscheduler (Logging.scala:logInfo)-Parents of Final stage:list () 2014-07-01 00:51:50,477 INFO [Spark-akka.actor. DEFAULT-DISPATCHER-16] Scheduler. Dagscheduler (Logging.scala:logInfo)-Missing parents:list () 2014-07-01 00:51:50,478 INFO [ SPARK-AKKA.ACTOR.DEFAULT-DISPATCHER-16] Scheduler. Dagscheduler (Logging.scala:logInfo)-Computing The requested partition locally2014-07-01 00:51:50,509 INFO [Local computation of Job 0] Rdd. Newhadooprdd (Logging.scala:logInfo)-Input split:localhost:,2014-07-01 00:51:50,894 INFO [main] spark. Sparkcontext (Logging.scala:logInfo)-Job Finished:take at <console>:25, took 0.522612687 sres5:array[(org.a Pache.hadoop.hbase.io.ImmutableBytesWritable, org.apache.hadoop.hbase.client.Result)] = Array (4a 6d,keyvalues={ Jim/course:art/1404142440676/put/vlen=2/mvcc=0, Jim/course:math/1404142434405/put/vlen=2/mvcc=0, Jim/grade:/ 1404142422653/put/vlen=1/mvcc=0}))

Find Result Object

scala> val res = hrdd.take (1) 2014-07-01 01:09:13,486 INFO [main] spark. Sparkcontext (Logging.scala:logInfo)-Starting job:take at <console>:242014-07-01 01:09:13,487 INFO [spark-a KKA.ACTOR.DEFAULT-DISPATCHER-15] Scheduler. Dagscheduler (Logging.scala:logInfo)-Got Job 4 (take for <console>:24) with 1 output partitions (allowlocal=tru e) 2014-07-01 01:09:13,487 INFO [spark-akka.actor.default-dispatcher-15] scheduler. Dagscheduler (Logging.scala:logInfo)-Final Stage:stage 4 (take at <console>:24) 2014-07-01 01:09:13,487 INFO [ SPARK-AKKA.ACTOR.DEFAULT-DISPATCHER-15] Scheduler. Dagscheduler (Logging.scala:logInfo)-Parents of Final stage:list () 2014-07-01 01:09:13,488 INFO [Spark-akka.actor. DEFAULT-DISPATCHER-15] Scheduler. Dagscheduler (Logging.scala:logInfo)-Missing parents:list () 2014-07-01 01:09:13,488 INFO [ SPARK-AKKA.ACTOR.DEFAULT-DISPATCHER-15] Scheduler. Dagscheduler (Logging.scala:logInfo)-Computing The requested partition Locally2014-07-01 01:09:13,488 INFO [Local computation of Job 4] Rdd. Newhadooprdd (Logging.scala:logInfo)-Input split:localhost:,2014-07-01 01:09:13,504 INFO [main] spark. Sparkcontext (Logging.scala:logInfo)-Job Finished:take at <console>:24, took 0.018069267 sres:array[(Org.ap Ache.hadoop.hbase.io.ImmutableBytesWritable, org.apache.hadoop.hbase.client.Result)] = Array (4a 6d,keyvalues={ Jim/course:art/1404142440676/put/vlen=2/mvcc=0, Jim/course:math/1404142434405/put/vlen=2/mvcc=0, Jim/grade:/ 1404142422653/put/vlen=1/mvcc=0})) scala> res (0) Res33: (Org.apache.hadoop.hbase.io.ImmutableBytesWritable, Org.apache.hadoop.hbase.client.Result) = (4a 6d,keyvalues={jim/course:art/1404142440676/put/vlen=2/mvcc=0, jim/ Course:math/1404142434405/put/vlen=2/mvcc=0, jim/grade:/1404142422653/put/vlen=1/mvcc=0}) scala> res (0). _ 2res34:org.apache.hadoop.hbase.client.result = keyvalues={jim/course:art/1404142440676/put/vlen=2/mvcc=0, Jim/ Course:math/1404142434405/put/vlen=2/mvcc=0, jim/grade:/1404142422653/put/vlen=1/mvcc=0}scala> val rs = res (0). _2rs:org.apache.hadoop.hbase.client.result = Keyvalues={jim/course:art/1404142440676/put/vlen=2/mvcc=0, Jim/course:math/1404142434405/put/vlen=2/mvcc=0, Jim/           grade:/1404142422653/put/vlen=1/mvcc=0}scala> rs.asinstanceof Cellscanner Containscolumn           Containsemptycolumn containsnonemptycolumn copyFrom GetColumn getcolumncells                   Getcolumnlatest Getcolumnlatestcell getexists Getfamilymap getmap                  Getnoversionmap GetRow getValue Getvalueasbytebuffer isEmpty                      isinstanceof list Listcells Loadvalue raw                      Rawcells setexists Size toString value

Iterate over this record and take out the value of each cell:

scala> val Kv_array = Rs.rawwarning:there were 1 deprecation warning (s); Re-run with-deprecation for Detailskv_array:array[org.apache.hadoop.hbase.keyvalue] = Array (Jim/course:art/ 1404142440676/put/vlen=2/mvcc=0, Jim/course:math/1404142434405/put/vlen=2/mvcc=0, Jim/grade:/1404142422653/Put/ vlen=1/mvcc=0)

Traversing Records

Scala> for (keyvalue <-kv) println ("Rowkey:" + new String (Keyvalue.getrow) + "CF:" +new string (keyvalue.getfamily () ) + "column:" + new String (Keyvalue.getqualifier) + "" + "value:" +new String (Keyvalue.getvalue ()) Warning:there were 4 Deprecation warning (s); Re-run with-deprecation for Detailsrowkey:jim cf:course column:art value:67rowkey:jim cf:course column:math value : 77rowkey:jim Cf:grade Column:value:3

Number of query records

scala> hrdd.count2014-07-01 01:26:03,133 INFO [main] spark. Sparkcontext (Logging.scala:logInfo)-Starting job:count at <console>:252014-07-01 01:26:03,134 INFO [spark- AKKA.ACTOR.DEFAULT-DISPATCHER-16] Scheduler. Dagscheduler (Logging.scala:logInfo)-Got Job 5 (count at <console>:25) with 1 output partitions (ALLOWLOCAL=FA LSE) 2014-07-01 01:26:03,134 INFO [spark-akka.actor.default-dispatcher-16] scheduler.  Dagscheduler (Logging.scala:logInfo)-Final stage:stage 5 (count at <console>:25) 2014-07-01 01:26:03,134 INFO [Spark-akka.actor.default-dispatcher-16] Scheduler. Dagscheduler (Logging.scala:logInfo)-Parents of Final stage:list () 2014-07-01 01:26:03,135 INFO [Spark-akka.actor. DEFAULT-DISPATCHER-16] Scheduler. Dagscheduler (Logging.scala:logInfo)-Missing parents:list () 2014-07-01 01:26:03,166 INFO [ SPARK-AKKA.ACTOR.DEFAULT-DISPATCHER-16] Scheduler. Dagscheduler (Logging.scala:logInfo)-Submitting Stage 5 (Newhadooprdd[0] at NewapihadOoprdd at <console>:22), which have no missing parents2014-07-01 01:26:03,397 INFO [Spark-akka.actor.default-dispat CHER-16] Scheduler. Dagscheduler (Logging.scala:logInfo)-Submitting 1 missing tasks from Stage 5 (newhadooprdd[0] at Newapihadooprdd at <console>:22) 2014-07-01 01:26:03,401 INFO [spark-akka.actor.default-dispatcher-16] scheduler. Taskschedulerimpl (Logging.scala:logInfo)-Adding Task set 5.0 with 1 tasks2014-07-01 01:26:03,427 INFO [Spark-akka. ACTOR.DEFAULT-DISPATCHER-16] Scheduler. Fairschedulablebuilder (Logging.scala:logInfo)-Added task set taskset_5 tasks to pool default2014-07-01 01:26:03,43 9 INFO [spark-akka.actor.default-dispatcher-5] scheduler. Tasksetmanager (Logging.scala:logInfo)-Starting task 5.0:0 as TID 0 on executor 0:192.168.2.105 (process_local) 2014 -07-01 01:26:03,469 INFO [spark-akka.actor.default-dispatcher-5] scheduler. Tasksetmanager (Logging.scala:logInfo)-serialized task 5.0:0 as 1305 bytes in 7 ms2014-07-01 01:26:11,015 INFO [Result resolver thread-0] Scheduler. Tasksetmanager (Logging.scala:logInfo)-finished TID 0 in 7568 ms on 192.168.2.105 (PROGRESS:1/1) 2014-07-01 01:26:11 , 017 INFO [Result Resolver thread-0] Scheduler. Taskschedulerimpl (Logging.scala:logInfo)-removed TaskSet 5.0, whose tasks all completed, from pool default2014 -07-01 01:26:11,036 INFO [spark-akka.actor.default-dispatcher-4] scheduler. Dagscheduler (Logging.scala:logInfo)-Completed Resulttask (5, 0) 2014-07-01 01:26:11,057 INFO [ SPARK-AKKA.ACTOR.DEFAULT-DISPATCHER-4] Scheduler. Dagscheduler (Logging.scala:logInfo)-Stage 5 (count at <console>:25) finished in 7.605 s2014-07-01 01:26:11,06 7 INFO [main] spark. Sparkcontext (Logging.scala:logInfo)-Job Finished:count at <console>:25, took 7.933270634 Sres71:long = 3

Version number two,

Hrdd.map (tuple = tuple._2). Map (Result = (Result.getrow, Result.getcolumn ("Course". GetBytes (), "Art". GetBytes ())). Map (row = {(  row._1.map (_.tochar). mkstring,  row._2.asscala.reduceleft {    (A, B) and if ( A.gettimestamp > B.gettimestamp) a else B  }.getvalue.map (_.tochar). mkstring)}). Take (10)

This will give you the value of the row key and the corresponding column family.

Iv. Summary
The spark operation HBase is in fact consistent with the Java client Operation HBAs The general process, which is the client to connect to Hmaster and finally leverages the Java API to manipulate HBase.

Just spark provides a concept that combines with RDD, and uses Scala's syntax simplicity. Improved programming efficiency.

--eof--
Original articles, reproduced please specify from: http://blog.csdn.net/oopsoom/article/details/36071323

Spark Operation HBase

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More