Spark1.3 adds Directstream to handle Kafka messages. Here's how to use it:
Kafkautils.createdirectstream[string, String, Stringdecoder, Stringdecoder] (SSC, Kafkaparams, Topicsset)
Ssc:streamcontext
Kafkaparams:kafka parameters, including Kafka's brokers, etc.
Topicsset: The topic to read.
This method creates an input steam that reads the message directly from the Kafka brokers, rather than creating any receiver (that is, not reading the message through Kafka's Advanced API)
This stream guarantees that the Kafka message will only be processed once.
*-No receivers required: The steam does not create any receiver. It directly queries the offset of the Kafka and does not need to zookeeper to save the offset that has already been consumed.
Of course, many of the existing Kafka monitoring tools will read the zookeeper data, so if you want to continue using Kafka's monitoring tools, you need to implement your own code to update the zookeeper offset.
can refer to Org.apache.spark.streaming.kafka.HasOffsetRanges
*-failback: After enabling the checkpoint mechanism, you can quickly recover from a failure when the driver fails.
Driver failure, the current Kafka read offset will also be saved, the recovery will continue processing from this offset, to ensure that the message is not lost, and read processing once.
def createdirectstream[
K:classtag,
V:classtag,
KD <: decoder[k] : Classtag,
VD <: decoder[v]: Classtag] (
Ssc:streamingcontext,
Kafkaparams:map[string, String],
topics:set[string]
//Create message handling
val MessageHandler = (mmd:messageandmetadata[k, V]) = (Mmd.key , mmd.message)
val reset = Kafkaparams.get ("Auto.offset.reset"). Map (_.tolowercase)
(For {
topicpartitions <-kc.getpartitions (topics). Right
leaderoffsets <-(if (reset = = Some ("smallest")) {
the corresponding leader node is obtained according to topic and partition, and the minimum value of offset is read (the minimum value may not be 0)
kc.getearliestleaderoffsets (topicpartitions)
} else {
The corresponding leader node is obtained according to topic and partition, and the maximum value of offset is read, then this stream will only process the new message of Kafka, and a little want to tail the command
kc.getlatestleaderoffsets (topicpartitions)
}). Right
} yield {
val fromoffsets = leaderoffsets.map {case (TP, lo) = =
(TP, Lo.offset)
}
//Create stream according to SSC, offsets, etc.
New Directkafkainputdstream[k, V, KD, VD, (K, V)] (
SSC, Kafkaparams, Fromoffsets, MessageHandler)
}). Fold (
errs = throw new Sparkexception (errs.mkstring ("\ n")),
OK = OK
)
}
The generated Directkafkainputdstream
class directkafkainputdstream[
K: Classtag,
V:classtag,
U <: DECODER[K]: Classtag,
R:classtag] (
@transient ssc_: StreamingContext,
Val kafkaparams:map[string, String],
Val fromoffsets:map[ Topicandpartition, Long],
Messagehandler:messageandmetadata[k, V] = r< /span>
) extends Inputdstream[r] (SSC_) with Logging {
val maxretries = context.sparkContext.getConf.getInt (
" Spark.streaming.kafka.maxRetries ", 1)
To create checkpoint data
Protected[streaming] override Val checkpointdata = new Directkafkainputdstreamcheckpointdata
...
The offset of the current Kafka, after which each batch of messages is processed, is updated with this offset. , this offset is also saved in Checkpointdata, and if driver fails, it can be restored to ensure that the message is not missed.
protected var currentoffsets = Fromoffsets
//Get the leader of each partition for each topic Kafka and read each leader already the largest offset.
//If Kafka's leader has been changed (due to machine fault light), it can be found as soon as possible.
@tailrec
protected Final def latestleaderoffsets (retries:int): map[topicandpartition, Leaderoffset] = {
val o = kc.getlatestleaderoffsets (currentoffsets.keyset)
//Either.fold would confuse @tailrec, do it manually
if (o.isleft) {
val Err = o.left.get.tostring
if (retries <= 0) {
throw new Sparkexception (ERR)
} else {
Log.error (Err)
Thread.Sleep (kc.config.refreshLeaderBackoffMs)
latestleaderoffsets (retries-1)
}
} else {
O.right.get
}
}
...
Stream calculation, generate RDD based on currentoffsets and Untiloffsets
Override Def Compute (validtime:time): Option[kafkardd[k, V, U, T, R]] = {
Val untiloffsets = Clamp (latestleaderoffsets (maxretries))
Val Rdd = kafkardd[k, V, U, T, R] (
Context.sparkcontext, Kafkaparams, Currentoffsets, Untiloffsets, MessageHandler)
Currentoffsets = Untiloffsets.map (kv = kv._1-Kv._2.offset)
Some (RDD)
}
...
Private[streaming]
Class Directkafkainputdstreamcheckpointdata extends Dstreamcheckpointdata (this) {
def batchfortime = data.asinstanceof[mutable. hashmap[
Time, Array[offsetrange.offsetrangetuple]]
Override def update (time:time) {
Batchfortime.clear ()
Generatedrdds.foreach {kv =
Val A = Kv._2.asinstanceof[kafkardd[k, V, U, T, R]].offsetranges.map (_.totuple). ToArray
Batchfortime + = Kv._1 A
}
}
Override def Cleanup (time:time) {}
//recover from failure, need to recalculate Generatedrdds
//This is assuming, the topics don ' t change during execution, which is true currently
val topics = Fromoffsets.keyset
val leaders = Kc.findleaders (topics). Fold (
errs = throw new Sparkexception (errs.mkstring ("\ n")),
OK = OK
BatchForTime.toSeq.sortBy (_._1) (time.ordering). foreach {case (t, b) = =
Loginfo (S "Restoring Kafkardd for Time $t ${b.mkstring (" [",", ","] ")}")
Generatedrdds + = t, new kafkardd[k, V, U, T, R] (
Context.sparkcontext, Kafkaparams, B.map (Offsetrange (_)), leaders, MessageHandler)
}
}
}
}
Integration of Spark/kafka