Sparkstreaming Implementing Exactly-once Semantics

Source: Internet
Author: User
Tags ranges

Syn Good son source: Http://www.cnblogs.com/cssdongl reprint please indicate the source

Translated from: http://blog.cloudera.com/blog/2015/03/exactly-once-spark-streaming-from-apache-kafka/

Check the information found above this article is good, although it is 1.3 of the old version of knowledge, but still have reference to the place, spare time in accordance with their own understanding of the translation, there are inappropriate places to welcome correct.

A new version of Apache Spark 1.3 includes new RDD and Dstream implementations that read data from Apache Kafka. As the main author of these features, I would like to explain their implementation and usage. You may be interested because you can benefit from the following:

1> using spark cluster resources more evenly when using Kafka
2> control of message delivery semantics
3> delivery guarantee without relying on pre-write logs in HDFs
4> Accessing message metadata


I assume you are familiar with the spark streaming documentation and Kafka documentation. All code samples are in Scala, but there are many ways to be friendly with Java in the API

Basic Usage

New APIs for Kafka Rdd and Dstream in the Spark-streaming-kafka module

SBT Dependency

Librarydependencies + = "Org.apache.spark" percent "Spark-streaming-kafka"% "1.3.0"

Maven dependencies:

<Dependency>  <groupId>Org.apache.spark</groupId>  <Artifactid>spark-streaming-kafka_2.10</Artifactid>  <version>1.3.0</version></Dependency>

To read data from Kafka using Spark streaming, use Kafkautils.createdirectstream:
ImportKafka.serializer.StringDecoderImportorg.apache.spark.SparkConfImportorg.apache.spark.streaming. {Seconds, StreamingContext}Importorg.apache.spark.streaming.kafka.KafkaUtils Val SSC=NewStreamingContext (NewSparkconf, Seconds (60)) //Hostname:port for Kafka brokers, not ZookeeperVal kafkaparams = Map ("Metadata.broker.list", "localhost:9092,anotherhost:9092") Val Topics= Set ("Sometopic", "Anothertopic") Val Stream=kafkautils.createdirectstream[string, String, Stringdecoder, Stringdecoder] (SSC, kafkaparams, topics)

The call to Createdirectstream returns a tuple stream, which is formed by the keys and values of each Kafka message. Its return type is Inputdstream [(k,v)], where the type of K and V in this case is string. The subclass implementation of this return type is Directkafkainputdstream. The Createdirectstream method also has other overloads that allow you to access message metadata and precisely specify a starting offset for each topic and partition.

If you read Kafka data from a spark job that is not a stream operation, use Kafkautils.createrdd:

ImportKafka.serializer.StringDecoderImportOrg.apache.spark. {sparkcontext, sparkconf}ImportOrg.apache.spark.streaming.kafka. {kafkautils, offsetrange} val sc=NewSparkcontext (Newsparkconf)//Hostname:port for Kafka brokers, not ZookeeperVal kafkaparams = Map ("Metadata.broker.list", "localhost:9092,anotherhost:9092") Val offsetranges=Array (Offsetrange ("Sometopic", 0, 110, 220), Offsetrange ("Sometopic", 1, 100, 313), Offsetrange ("Anothertopic", 0, 456, 789)) Val Rdd=kafkautils.createrdd[string, String, Stringdecoder, Stringdecoder] (SC, kafkaparams, offsetranges)

The call to Createrdd returns an (key,value) formatted tuple RDD for each Kafka message within the offset range of the specified batch. Its return type is RDD [(K,V)], and the subclass implementation is Kafkardd. The Createrdd method also has other overloads that allow you to access the message metadata and specify the current leader for each topic and also for the partition.

Implementation
Directkafkainputdstream is a batch stream. The corresponding Kafkardd for each batch association. Each partition of the Kafkardd corresponds to a offsetrange. Most implementations are private, but
It is very useful to understand the future.

Offsetrange
Offsetrange represents the lower and upper bounds for a particular message sequence in a given Kafka topic and partition, and the following is its data structure:

Offsetrange ("Visits", 2, 300, 310)

This line of code identifies 10 messages from offset 300 (including) to offset 310 (not included) from the 2nd partition in the "visits" topic. Note that it does not actually contain the contents of the message, it is just a way of identifying the scope.

Also note that because the Kafka sort is defined only on a per-partition basis, the following line of code

Offsetrange ("Visits", 3, 300, 310)

A reference to a message may come from a completely different time period; Partitions are different even if the offsets are the same as above.

Kafkardd
Recall that the Rdd class is defined as follows:
1> contains a method for partitioning the job (getpartitions)
2> contains methods to perform work for the specified partition (COMPUTE)
3> the list of parent Rdd, Kafkardd is an input, not a transform, so it has no parent
4> (optional) defines how partitioner the key is hashed. Kafkardd not defined.
5> (optional) a list of preferred hosts for a given partition to push the calculation to the location where the data resides (Getpreferredlocations).


The Kafkardd constructor receives a offsetranges array and a map of the host and port of the current leader, which contains all the topic and their partitions. The reason for separating leader information is to allow the Kafkautils.createrdd method to easily call the Kafkardd constructor, and you do not need to know the leader information in this case. Createrdd will use the host list specified in Metadata.broker.list as the initial contact to invoke the necessary meta-data information for the Kafka API to find leader The initial lookup will occur only once in the spark driver process

The GetPartitions method of Kafkardd uses each offsetrange in the array and converts it to an RDD partition by adding leader host and port information. It is important to note that there is a 1:1 correspondence between the Kafka partition and the RDD partition. This means that the degree of spark parallelism (at least for read messages) is directly related to the degree of Kafka parallelism.


The Getpreferredlocations method uses the Kafka leader of a given partition as the preferred host. I'm not running my spark executors on the same host as Kafka, so if you do it, let me know how you make it work

The compute method runs in the spark executors process. It uses Kafka Simpleconsumer to connect to the leader of a given topic and partition, and then repeatedly gets the request to read the offset of the specified range of messages.

Each message is converted using the MessageHandler parameter in the constructor. MessageHandler is a user-defined type function of Kafka messageandmetadata type, which defaults to a tuple of keys and values. In most cases, this type of access to topics and offset metadata is more effective on a per-partition basis (see the discussion under Hasoffsetranges), but if you really need to associate each message with its offset, you can do so.

The key point about the calculation is that since the offset range is predefined on the driver, it is read directly from the Kafka by the executor, and the message returned by the particular kafkardd is deterministic. Therefore, there is no significant state on executors, and there is no concept of submitting read offsets to Apache zookeeper, as there is a preference for using Kafka advanced consumer solutions.

Because the calculation operation is deterministic, the task can often be re-tried if the task fails. For example, if Kafka leader is lost, the calculation method sleeps, and the sleep time depends on the time defined in the Kafka parameter refresh.leader.backoff.ms Kafka, then the task fails and the normal Spark task retry mechanism processes it. In subsequent attempts after the first, the new leader finds logical execution as part of the execution of the executor compute method.

Directkafkainputdstream

If you have existing code to get and manage offsets, the Kafkardd returned by Kafkautils.createrdd can be used with batch jobs. However, in most cases, you might use Kafkautils.createdirectstream, which returns a Directkafkainputdstream. Similar to Rdd,dstream defined as:
1> contains a list of parent dstreams. Again, this is an input dstream, not a conversion, so here it does not have the parent
The 2> contains the time interval at which the stream generates a batch. This stream uses the time interval defined in the context
The 3> contains a method for generating an rdd for a given time interval (compute).

The compute method runs on the driver. It connects the leader of each topic and partition, instead of reading the message and getting the latest available offset. The KAFKARDD is then defined, with an offset range spanning from the end point of the last batch to the last leader offset.

To define the starting point for the first batch, you can specify an exact offset for each topicandpartition, or use the Kafka parameter Auto.offset.reset, which can be set to maximum or minimum (the default is maximum). For rate limiting, you can use the Spark configuration variable spark.streaming.kafka.maxRatePerPartition to set the maximum number of messages per batch for each partition.

Once a kafkardd is defined for a given time interval, it is fully executed in accordance with the above batch case scenario. Unlike the previous Kafka Dstream implementations, there is no reveiver task where this long run consumes the core of each stream, regardless of the amount of messages, and for our use cases in kixer, there are often important but few topics in the same job for a large number of topics. With direct stream, a low-volume partition causes the smaller task to complete quickly and frees the node to process other partitions in the batch. At the same time keeping the various themes logically separate, this is a considerable success, as it balances the use of clusters,

The significant difference from batch usage is that there are some important states that change over time, in other words, the range of offsets that occur at each time interval. Executor or Kafka leader failure is not a big problem, as described above, but if the driver fails, the offset range will be lost unless it is stored somewhere. I'll discuss this in more detail in the delivery semantics below, but you basically have three choices:
1> If you don't care about missing or duplicate message, don't worry, just restart the stream from the earliest or most recent offset
2> creates a checkpoint for stream, in which case the offset range (not the message, just the offset range definition) is stored in the checkpoint.
3> itself stores the offset range and provides the correct starting offset when the stream is restarted

Similarly, no consumer offsets are stored in the zookeeper. If you want to talk to ZK directly with the existing Kafka monitoring tool, you need to store the offset to ZK (which doesn't mean it needs to be your offset recording system, you can just copy them)

Note that because Kafka is treated as a persisted message store instead of a transient network source, you do not need to copy the message to HDFs for error recovery. However, this design does have some implications. First, you cannot read a message that no longer exists in Kafka, so make sure that you have saved enough message. The second or you can't read a message that doesn't exist in the Kafka. In other words, the performer's consumer does not poll for new messages, and the driver only periodically checks with the leader at each batch interval, so there are some inherent delays.

Hasoffsetranges

Another implementation detail is the public interface, Hasoffsetranges, with a single method that returns an array of Offsetrange. Kafkardd implements this interface, allowing you to get topic and offset information on a per-partition basis.

Val stream =Kafkautils.createdirectstream (...) ... stream.foreachrdd {Rdd=//Cast the RDD to an interface that lets us get a collection of offset rangesVal offsets =rdd.asinstanceof[hasoffsetranges].offsetranges Rdd.mappartitionswithindex {(i, ITER)=//index to get the correct offset range for the RDD partition we ' re working onVal Osr:offsetrange =offsets (i)//get any needed data from the offset rangeVal topic =osr.topic Val Kafkapartitionid=osr.partition Val begin=osr.fromoffset Val End=osr.untiloffset ...

The reason for using this indirection layer is because the Dstream method uses static types like Foreachrdd and transform only the RDD, not the underlying implementation type (in this case, private). Because the Dstream returned by Createdirectstream generates KAFKARDD batches, it can be safely converted to hasoffsetranges. Also note that because of the 1:1 correspondence between the offset range and the RDD partition, the index of the RDD partition corresponds to the index in the array returned by Offsetranges.
Delivery semantics

First, understand the Kafka documentation for delivery semantics. If you have read it, please read it again. In short: Consumer delivery semantics depend on you, not Kafka.

Second, understand that spark cannot guarantee the one-time semantics of the output action. When spark streaming guide talks about at least one time, it simply means that a given item in an RDD is included in the calculated value once, purely in a functional sense. Any output operation that contains side effects (that is, any action that you save the results in Foreachrdd) may be duplicated because any stage of the process may fail and retry.


Third, understand that spark checkpoints may not be recoverable, such as when you need to change the application code to get the stream restarted. This situation may be improved in version 1.4, but be aware that this is a problem. I've met this hole before, and you may be. Anywhere, I mentioned "Checkpoint stream" as an option to consider the risks involved. Also note that any window transformations will depend on the checkpoint anyway,

Finally, I will repeat that any semantics except at most once requires that you have enough log retention in Kafka. If you see something like offsetoutofrangeexception, it may be because your Kafka is not storing enough, not because of the spark or Kafka error.

Given all this, how do you get the semantics equivalent to what you want?

At-most-once
This can be useful in situations where you send the results to a non-logging system, do not want to repeat the situation, and make sure that you are not caught in this hassle of message loss. An example is sending digest statistics over UDP because it starts with an unreliable protocol

To get the most common semantics, do all of the following:
1> sets Spark.task.maxFailures to 1, so the job ends immediately when the job fails.
2> ensures that Spark.speculation is False (the default), so that multiple copies of the task are not run in a speculative manner.
3> when the job dies, the Kafka param Auto.offset.reset is set to the "max" start stream backup, so it jumps to the current end of the log.

This means that you will lose the message on reboot, but at least probably should not repeat the test carefully, if your message is not repeated to you actually very important, because it is not a common use case, I did not provide its sample code.

At-least-once
You can repeat the message, but you will not lose the message. An example of this stream is relatively rare, such as sending an internal email alert. Getting repetitive emergency alarms in a short time is much better than not getting them.

The basic options are as follows:
1> establish a steam checkpoint, or
2> set Auto.offset.reset to minimal and restart the job. This will re-fetch the entire log from your reservation, so you'd better keep it relatively short, or do a good job of repeating the message.

Establish a checkpoint on stream as the basis for the next option, so look at its sample code.

Exactly-once using Idempotent writes

Idempotent writes make duplicate messages secure at least once to the equivalent of one at a time. The typical way to do this is by having a unique key of some type (embedded in the message, or using the theme/partition/offset as key) and storing the result based on that key. A unique key that relies on each message means that this is useful for converting or filtering individually valuable messages, not necessarily for aggregating multiple messages.

There is a complete example of this idea in Idempotentexample.scala. It uses Postgres for consistency with the next example, but can use any storage system that allows a unique key.

Stream.foreachrdd {rdd =      +        // make sure connection pool was set up on the executor B Efore writing        setupjdbc (Jdbcdriver, Jdbcurl, Jdbcuser, Jdbcpassword)         case (key, msg) = >          =/            /  the unique key for idempotency is just the text of the message itself, FO R example purposes            SQL "INSERT into Idem_data (msg) VALUES (${msg})". Update.apply          }        }    }

In the case of failure, the above output actions can be safely retried. Checkpoint Steam makes sure that the offset range is saved at build time. Checkpoints are done in the usual way, by defining the configuration Flow context (SSC) and by setting steam capabilities, and then calling
Ssc.checkpoint (Checkpointdir)
Before returning to SSC. For more details, see Streaming guide
Exactly-once using Transactional writes
For data stores that support transactions, the offsets in the same transaction can be kept synchronized even in the event of a failure. If you carefully examine those offset ranges that are repeated or skipped, rolling back a transaction prevents duplicate or missing message effects. This gives the equivalent of exactly one semantic and is used directly for aggregation.
Transactionalexample.scala is a complete spark job that implements this idea. Although it uses postgres, it can use any data store that has transactional semantics.
The first important point is to start the stream with the last successful commit offset as the start point. This allows for failure recovery:

//begin from the offsets committed to the databaseVal fromoffsets = db.readonly {implicit session = =SQL"Select topic, part, off from Txn_offsets". Map {ResultSet=topicandpartition (resultset.string (1), ResultSet.int(2)), ResultSet.Long(3)}.list.apply (). Tomap} val Stream:inputdstream[long]=kafkautils.createdirectstream[string, String, Stringdecoder, Stringdecoder, Long] (SSC, Kafkaparams, fromoffsets ,      //we ' re just going to count messages, and don ' t care about the contents, so-convert each message to a 1(Mmd:messageandmetadata[string, String]) = 1L)

For the first run of the job, you can preload the table with the appropriate starting offset.
as described in the discussion of hasoffsetranges, the example accesses the offset range according to each partition,
Note that Mappartitionswithindex is a conversion, And there is no equivalent foreachpartitionwithindex operation. The RDD conversion is usually lazy, so unless you add some kind of output action, Spark will never dispatch the job to do anything. It is enough to raise the rdd in empty body with a foreach. Also, note that some iterator methods, such as map, are lazy. If you are setting a transient state, such as a network or database connection, the connection may be closed when the mapping is fully enforced. In this case, be sure to use a method like foreach that is passionate about using iterators.

 Rdd.mappartitionswithindex {(i, iter) = //  set up some connection   Iter.foreach { //  Use the connection  } //  Close the connection   Iterator.empty}.foreach { //  Without an action, the job won ' t get scheduled, so empty foreach to force it  //  This was a little awkward, but there was no foreachpartitionwithindex method on Rdds  (_: Nothing) => ()}  

The last example to note is that it is important to ensure that the results are saved and the offsets are either successful or fail. If the previously committed offset is not equal to the start of the current offset range, the storage offset will fail; This will prevent gaps or duplicates. Kafka semantics ensures that there are no gaps in the messages within the offset range (if you are particularly concerned, you can verify by comparing the size of the offset range to the number of messages).

//Localtx is transactional, if metric update or offset update fails, neither would be committedDb.localtx {Implicit session =//Store metric DataVal metricrows = sql"""Update Txn_data Set metric = metric +${metric} where topic=${osr.topic}"" ". Update.apply ()if(Metricrows! = 1) {        Throw NewException ("...")      }       //Store OffsetsVal offsetrows = sql"""Update txn_offsets set off =${osr.untiloffset} where topic= ${osr.topic} and part = ${osr.partition} and off =${osr.fromoffset}"" ". Update.apply ()if(Offsetrows! = 1) {        Throw NewException ("...")      }    }

The sample code throws an exception, which causes the transaction to be rolled back. Other fault-handling policies may be appropriate as long as they also cause transactions to be rolled back.

Future Improvements
Although the Spark 1.3 feature is considered experimental, the underlying KAFKARDD design has been in production for several months in kixer. It currently processes billions of messages per day, with batch sizes ranging from 2 seconds to 5 minutes. That being said, there are already a lot of things that can be improved (and there may be some unknown places).
1> Connection pool. Currently, Kafka consumer connections are created as needed; The pool should contribute to efficiency. Hopefully this can be done in such a way that it is well integrated with the Kafka producer API in progress toward Spark.
2>kafka Meta Data API. Classes that interact with Kafka are currently private, meaning that if you want to access Kafka metadata through the low-level API, you need to replicate some of the work. This is partly because the Kafka consumer migration API is not yet stable, and if the code proves to be stable, it is good to have a user-oriented API to interact with the Kafka metadata.
3> Batch Generation policy. The rate limit is now the only adjustment that can be used to define the next batch of flows. We have some use cases involving larger adjustments, such as fixed time delays. A flexible way to define batch build policies can be useful

If there are other improvements you can think of, please let me know.

Finally translated, some complex sentence meaning I combined with context and related knowledge for a long time, there are inappropriate places to welcome correct.

Sparkstreaming Implementing Exactly-once Semantics

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.