Integration of Spark/kafka

Source: Internet
Author: User

Spark1.3 adds Directstream to handle Kafka messages. Here's how to use it:

Kafkautils.createdirectstream[string, String, Stringdecoder, Stringdecoder] (SSC, Kafkaparams, Topicsset)
Ssc:streamcontext
Kafkaparams:kafka parameters, including Kafka's brokers, etc.
Topicsset: The topic to read.

This method creates an input steam that reads the message directly from the Kafka brokers, rather than creating any receiver (that is, not reading the message through Kafka's Advanced API)
This stream guarantees that the Kafka message will only be processed once.
*-No receivers required: The steam does not create any receiver. It directly queries the offset of the Kafka and does not need to zookeeper to save the offset that has already been consumed.
Of course, many of the existing Kafka monitoring tools will read the zookeeper data, so if you want to continue using Kafka's monitoring tools, you need to implement your own code to update the zookeeper offset.
can refer to Org.apache.spark.streaming.kafka.HasOffsetRanges
*-failback: After enabling the checkpoint mechanism, you can quickly recover from a failure when the driver fails.
Driver failure, the current Kafka read offset will also be saved, the recovery will continue processing from this offset, to ensure that the message is not lost, and read processing once.

def createdirectstream[
K:classtag,
V:classtag,
KD <: decoder[k] : Classtag,
VD <: decoder[v]: Classtag] (
Ssc:streamingcontext,
Kafkaparams:map[string, String],
topics:set[string]

//Create message handling
val MessageHandler = (mmd:messageandmetadata[k, V]) = (Mmd.key , mmd.message)

val reset = Kafkaparams.get ("Auto.offset.reset"). Map (_.tolowercase)

(For {
topicpartitions <-kc.getpartitions (topics). Right
leaderoffsets <-(if (reset = = Some ("smallest")) {
the corresponding leader node is obtained according to topic and partition, and the minimum value of offset is read (the minimum value may not be 0)
kc.getearliestleaderoffsets (topicpartitions)
} else {
The corresponding leader node is obtained according to topic and partition, and the maximum value of offset is read, then this stream will only process the new message of Kafka, and a little want to tail the command
kc.getlatestleaderoffsets (topicpartitions)
}). Right
} yield {
val fromoffsets = leaderoffsets.map {case (TP, lo) = =
(TP, Lo.offset)
}
//Create stream according to SSC, offsets, etc.
New Directkafkainputdstream[k, V, KD, VD, (K, V)] (
SSC, Kafkaparams, Fromoffsets, MessageHandler)
}). Fold (
errs = throw new Sparkexception (errs.mkstring ("\ n")),
OK = OK
)
}

The generated Directkafkainputdstream

class directkafkainputdstream[
K: Classtag,
V:classtag,
U <: DECODER[K]: Classtag,

R:classtag] (
@transient ssc_: StreamingContext,
Val kafkaparams:map[string, String],
Val fromoffsets:map[ Topicandpartition, Long],
Messagehandler:messageandmetadata[k, V] = r< /span>
) extends Inputdstream[r] (SSC_) with Logging {
val maxretries = context.sparkContext.getConf.getInt (
" Spark.streaming.kafka.maxRetries ", 1)

To create checkpoint data
Protected[streaming] override Val checkpointdata = new Directkafkainputdstreamcheckpointdata
...

The offset of the current Kafka, after which each batch of messages is processed, is updated with this offset. , this offset is also saved in Checkpointdata, and if driver fails, it can be restored to ensure that the message is not missed.
protected var currentoffsets = Fromoffsets

//Get the leader of each partition for each topic Kafka and read each leader already the largest offset.
//If Kafka's leader has been changed (due to machine fault light), it can be found as soon as possible.
@tailrec
protected Final def latestleaderoffsets (retries:int): map[topicandpartition, Leaderoffset] = {
val o = kc.getlatestleaderoffsets (currentoffsets.keyset)
//Either.fold would confuse @tailrec, do it manually
if (o.isleft) {
val Err = o.left.get.tostring
if (retries <= 0) {
throw new Sparkexception (ERR)
} else {
Log.error (Err)
Thread.Sleep (kc.config.refreshLeaderBackoffMs)
latestleaderoffsets (retries-1)
}
} else {
O.right.get
}
}
...

Stream calculation, generate RDD based on currentoffsets and Untiloffsets
Override Def Compute (validtime:time): Option[kafkardd[k, V, U, T, R]] = {
Val untiloffsets = Clamp (latestleaderoffsets (maxretries))
Val Rdd = kafkardd[k, V, U, T, R] (
Context.sparkcontext, Kafkaparams, Currentoffsets, Untiloffsets, MessageHandler)

Currentoffsets = Untiloffsets.map (kv = kv._1-Kv._2.offset)
Some (RDD)
}

...


Private[streaming]
Class Directkafkainputdstreamcheckpointdata extends Dstreamcheckpointdata (this) {
def batchfortime = data.asinstanceof[mutable. hashmap[
Time, Array[offsetrange.offsetrangetuple]]

Override def update (time:time) {
Batchfortime.clear ()
Generatedrdds.foreach {kv =
Val A = Kv._2.asinstanceof[kafkardd[k, V, U, T, R]].offsetranges.map (_.totuple). ToArray
Batchfortime + = Kv._1 A
}
}

Override def Cleanup (time:time) {}

//recover from failure, need to recalculate Generatedrdds

//This is assuming, the topics don ' t change during execution, which is true currently
val topics = Fromoffsets.keyset
val leaders = Kc.findleaders (topics). Fold (
errs = throw new Sparkexception (errs.mkstring ("\ n")),
OK = OK

BatchForTime.toSeq.sortBy (_._1) (time.ordering). foreach {case (t, b) = =
Loginfo (S "Restoring Kafkardd for Time $t ${b.mkstring (" [",", ","] ")}")
Generatedrdds + = t, new kafkardd[k, V, U, T, R] (
Context.sparkcontext, Kafkaparams, B.map (Offsetrange (_)), leaders, MessageHandler)
}
}
}

}

Integration of Spark/kafka

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.