Store offset with HBase

Source: Internet
Author: User
Tags commit ranges vcard zookeeper

Offset Management for Apache Kafka with Apache Spark streaming

June 21, 2017

By Guru Medasani, Jordan HambletonComments CATEGORIES:CDH Kafka Spark


An ingest pattern so we commonly see being adopted @ Cloudera customers is Apache Spark streaming applications which re Ad data from Kafka. Streaming data continuously from Kafka have many benefits such as have the capability to gather insights faster. However, users must take into consideration management of KAFKA offsets in order to recover their streaming application fr OM failures. In this post, we'll provide an overview of Offset Management and following topics. Storing offsets in external data stores checkpoints HBase ZooKeeper Kafka not managing offsets 


overview of Offset Mana Gement

Spark Streaming integration with Kafka allows the users to read messages from a single Kafka topic or multiple Kafka topics. A Kafka topic receives messages across a distributed set of partitions where they is stored. Each partition maintains the messages it had received in a sequential order where they is identified by an offset, also K Nown as a position. Developers can take advantage of using offsets in their application to control the position of where their Spark streaming Job reads from, but it does require offset management.



Managing offsets is most beneficial to achieve data continuity over the lifecycle of the stream process. For example, upon shutting down the stream application or an unexpected failure, offset ranges would be lost unless persist Ed in a Non-volatile data store. Further, without offsets of the partitions being read, the Spark streaming job is not being able to continue processing dat A from the where it had last left off.


 The above diagram depicts the general flow for managing offsets in your Spark streaming application. Offsets can managed in several ways, but generally follow this common sequence of steps. Upon initialization of the Direct DStream, a map of offsets for each topic ' s partition can be specified of where the Direc T DStream should start reading from for each partition. The offsets specified is in the same location, the step 4 below writes to. The batch of messages can then is read and processed. After processing, the results can is stored as well as offsets. The dotted line around store Results and commit offsets actions simply highlights a sequence of steps Where users may want to further review if a special scenario of stricter delivery semantics is required. This could include review of idempotent operations or storing the results with their offsets in an atomic operation. Lastly, any external durable data store SUCh as HBase, Kafka, HDFS, and ZooKeeper is used to keep track of which messages has already been processed.


Different scenarios can be incorporated to the above steps depending upon business requirements. Spark ' s programmatic flexibility allows users fine-grained control to store offsets before or after periodic phases of pro Cessing. Consider an application where the following is Occurring:a Spark streaming application are reading messages from Kafka, PE Rforming a lookup against HBase data to enrich or transform the messages and then posting the enriched messages to another Topic or separate system (e.g. other messaging system, back to HBase, SOLR, DBMS, etc.). In this case, we have consider the messages as processed when they is successfully posted to the secondary system.


  storing offsets externally

In this section, we explore different options for persisting offsets externally in a durable data store.


For the approaches mentioned in this section, if using the SPARK-STREAMING-KAFKA-0-10 library, we recommend users to set Enable.auto.committo False. This configuration was only applicable to the version, and by settingEnable.auto.commitTo true means that offsets is committed automatically with a frequency controlled by the Config auto.commit.interval.ms. In Spark streaming, setting-true commits the offsets to Kafka automatically when messages is read from Kafka whic H doesn ' t necessarily mean that Spark have finished processing those messages. To enable precise control for committing offsets, set Kafka parameter Enable.auto.commitTo false and follow one of the options below.


Spark Streaming Checkpoints

Enabling spark streaming ' s checkpoint is the simplest method for storing offsets, as it is readily available within Spark ' S framework. Streaming checkpoints is purposely designed to save the state of the application, in our case to HDFS, so it can Recovered upon failure.



Checkpointing the Kafka Stream would cause the offset ranges to being stored in the checkpoint. If there is a failure, the Spark streaming application can begin reading the messages from the checkpoint offset ranges. However, spark streaming checkpoints is not recoverable across applications or spark upgrades and hence not very reliable , especially if you is using this mechanism for a critical production application. We do not recommend managing offsets via Spark checkpoints. 


storing offsets in HBase

HBase can is used as an external data store to preserve offset ranges in a reliable fashion. By storing offset ranges externally, it allows Spark streaming applications the ability to restart and replay messages fro M any point in time as long as the messages is still alive in Kafka.



With HBase ' s generic design, the application are able to leverage the row key and column structure to handle storing offset Ranges across multiple Spark streaming applications and Kafka topics within the same table. In this example, each entry written to the table can is uniquely distinguished with a row key containing the topic name, C Onsumer group ID, and the Spark streaming batchtime.milliseconds. Although batchtime.milliseconds isn ' t required, it does provide insight to historical batches and the offsets which were p Rocessed. New Records would accumulate in the table which we had configured in the below design to automatically expire S. Below is the HBase table DDL and structure.



Ddl


create ‘stream_kafka_offsets’, {NAME=>‘offsets’, TTL=>2592000}




RowKey Layout


row:              <TOPIC_NAME>:<GROUP_ID>:<EPOCH_BATCHTIME_MS>
column family:    offsets
qualifier:        <PARTITION_ID>
value:            <OFFSET_ID>



For each batch of messages, Saveoffsets () function was used to persist last read offsets for a given Kafka topic in HBase.


/*
Save offsets for each batch into HBase
*/
def saveOffsets(TOPIC_NAME:String,GROUP_ID:String,offsetRanges:Array[OffsetRange],
                hbaseTableName:String,batchTime: org.apache.spark.streaming.Time) ={
  val hbaseConf = HBaseConfiguration.create()
  hbaseConf.addResource(“src/main/resources/hbase-site.xml”)
  val conn = ConnectionFactory.createConnection(hbaseConf)
  val table = conn.getTable(TableName.valueOf(hbaseTableName))
  val rowKey = TOPIC_NAME + “:” + GROUP_ID + “:” +String.valueOf(batchTime.milliseconds)
  val put = new Put(rowKey.getBytes)
  for(offset <– offsetRanges){
    put.addColumn(Bytes.toBytes(“offsets”),Bytes.toBytes(offset.partition.toString),
          Bytes.toBytes(offset.untilOffset.toString))
  }
  table.put(put)
  conn.close()
}


At the beginning of the streaming job, getLastCommittedOffsets() function is used to read the kafka topic offsets from HBase that were last processed when Spark Streaming application stopped. Function handles the following common scenarios while returning kafka topic partition offsets.

Case 1: Streaming job is started for the first time. Function queries the zookeeper to find the number of partitions in a given topic. It then returns ‘0’ as the offset for all the topic partitions.

Case 2: Long running streaming job had been stopped and new partitions are added to a kafka topic. Function queries the zookeeper to find the current number of partitions in a given topic. For all the old topic partitions, offsets are set to the latest offsets found in HBase. For all the new topic partitions, it returns ‘0’ as the offset.

Case 3: Long running streaming job had been stopped and there are no changes to the topic partitions. In this case, the latest offsets found in HBase are returned as offsets for each topic partition.

When new partitions are added to a topic once the streaming application is started, only messages from the topic partitions that were detected during the start of the streaming application are ingested. For streaming job to read the messages from newly added topic partitions, job has to be restarted.



Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.