Spark Streaming is a micro-batch stream processing framework built on Spark. HBase and Spark Streaming are a good partner, because HBase can provide the following benefits together with Spark Streaming:
Where to get reference data or configuration file data instantly
Store counted or aggregated locations in a manner that supports Spark
Streaming commitments that only process once.
The integration point of the HBase-Spark module and Spark Streaming is similar to its conventional Spark integration point, because the following commands can be implemented directly through Spark Streaming DStream.
bulkPut
Used to send put to HBase massively in parallel
bulkDelete
Used to send delete to HBase massively in parallel
bulkGet
Used for mass parallel sending of get to HBase to create a new RDD
mapPartition
Use the Connection object to execute Spark Map functions to allow full access to HBase
hBaseRDD
Simplify distributed scanning to create RDD
Example bulkPut with DStream
The following is an example of bulkPut using DStreams. The RDD batch placement feels very close.
val sc = new SparkContext("local", "test")
val config = new HBaseConfiguration()
val hbaseContext = new HBaseContext(sc, config)
val ssc = new StreamingContext(sc, Milliseconds(200))
val rdd1 = ...
val rdd2 = ...
val queue = mutable.Queue[RDD[(Array[Byte], Array[(Array[Byte],
Array[Byte], Array[Byte])])]]()
queue += rdd1
queue += rdd2
val dStream = ssc.queueStream(queue)
dStream.hbaseBulkPut(
hbaseContext,
TableName.valueOf(tableName),
(putRecord) => {
val put = new Put(putRecord._1)
putRecord._2.foreach((putValue) => put.addColumn(putValue._1, putValue._2, putValue._3))
put
})
The hbaseBulkPut function has three inputs: hbaseContext with boardboard configuration information links us to the HBase Connections in the executive, the table name of the table where we put the data, and the function that converts the records in the DStream into HBase Put objects.
The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion;
products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the
content of the page makes you feel confusing, please write us an email, we will handle the problem
within 5 days after receiving your email.
If you find any instances of plagiarism from the community, please send an email to:
info-contact@alibabacloud.com
and provide relevant evidence. A staff member will contact you within 5 working days.