Integration of Spark/kafka

Last Update:2015-05-05 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Spark1.3 adds Directstream to handle Kafka messages. Here's how to use it:

Kafkautils.createdirectstream[string, String, Stringdecoder, Stringdecoder] (SSC, Kafkaparams, Topicsset)
Ssc:streamcontext
Kafkaparams:kafka parameters, including Kafka's brokers, etc.
Topicsset: The topic to read.

This method creates an input steam that reads the message directly from the Kafka brokers, rather than creating any receiver (that is, not reading the message through Kafka's Advanced API)
This stream guarantees that the Kafka message will only be processed once.
*-No receivers required: The steam does not create any receiver. It directly queries the offset of the Kafka and does not need to zookeeper to save the offset that has already been consumed.
Of course, many of the existing Kafka monitoring tools will read the zookeeper data, so if you want to continue using Kafka's monitoring tools, you need to implement your own code to update the zookeeper offset.
can refer to Org.apache.spark.streaming.kafka.HasOffsetRanges
*-failback: After enabling the checkpoint mechanism, you can quickly recover from a failure when the driver fails.
Driver failure, the current Kafka read offset will also be saved, the recovery will continue processing from this offset, to ensure that the message is not lost, and read processing once.

def createdirectstream[
K:classtag,
V:classtag,
KD <: decoder[k] : Classtag,
VD <: decoder[v]: Classtag] (
Ssc:streamingcontext,
Kafkaparams:map[string, String],
topics:set[string]

//Create message handling
val MessageHandler = (mmd:messageandmetadata[k, V]) = (Mmd.key , mmd.message)

val reset = Kafkaparams.get ("Auto.offset.reset"). Map (_.tolowercase)

(For {
topicpartitions <-kc.getpartitions (topics). Right
leaderoffsets <-(if (reset = = Some ("smallest")) {
the corresponding leader node is obtained according to topic and partition, and the minimum value of offset is read (the minimum value may not be 0)
kc.getearliestleaderoffsets (topicpartitions)
} else {
The corresponding leader node is obtained according to topic and partition, and the maximum value of offset is read, then this stream will only process the new message of Kafka, and a little want to tail the command
kc.getlatestleaderoffsets (topicpartitions)
}). Right
} yield {
val fromoffsets = leaderoffsets.map {case (TP, lo) = =
(TP, Lo.offset)
}
//Create stream according to SSC, offsets, etc.
New Directkafkainputdstream[k, V, KD, VD, (K, V)] (
SSC, Kafkaparams, Fromoffsets, MessageHandler)
}). Fold (
errs = throw new Sparkexception (errs.mkstring ("\ n")),
OK = OK
)
}

The generated Directkafkainputdstream

class directkafkainputdstream[
K: Classtag,
V:classtag,
U <: DECODER[K]: Classtag,

R:classtag] (
@transient ssc_: StreamingContext,
Val kafkaparams:map[string, String],
Val fromoffsets:map[ Topicandpartition, Long],
Messagehandler:messageandmetadata[k, V] = r< /span>
) extends Inputdstream[r] (SSC_) with Logging {
val maxretries = context.sparkContext.getConf.getInt (
" Spark.streaming.kafka.maxRetries ", 1)

To create checkpoint data
Protected[streaming] override Val checkpointdata = new Directkafkainputdstreamcheckpointdata
...

The offset of the current Kafka, after which each batch of messages is processed, is updated with this offset. , this offset is also saved in Checkpointdata, and if driver fails, it can be restored to ensure that the message is not missed.
protected var currentoffsets = Fromoffsets

//Get the leader of each partition for each topic Kafka and read each leader already the largest offset.
//If Kafka's leader has been changed (due to machine fault light), it can be found as soon as possible.
@tailrec
protected Final def latestleaderoffsets (retries:int): map[topicandpartition, Leaderoffset] = {
val o = kc.getlatestleaderoffsets (currentoffsets.keyset)
//Either.fold would confuse @tailrec, do it manually
if (o.isleft) {
val Err = o.left.get.tostring
if (retries <= 0) {
throw new Sparkexception (ERR)
} else {
Log.error (Err)
Thread.Sleep (kc.config.refreshLeaderBackoffMs)
latestleaderoffsets (retries-1)
}
} else {
O.right.get
}
}
...

Stream calculation, generate RDD based on currentoffsets and Untiloffsets
Override Def Compute (validtime:time): Option[kafkardd[k, V, U, T, R]] = {
Val untiloffsets = Clamp (latestleaderoffsets (maxretries))
Val Rdd = kafkardd[k, V, U, T, R] (
Context.sparkcontext, Kafkaparams, Currentoffsets, Untiloffsets, MessageHandler)

Currentoffsets = Untiloffsets.map (kv = kv._1-Kv._2.offset)
Some (RDD)
}

...

Private[streaming]
Class Directkafkainputdstreamcheckpointdata extends Dstreamcheckpointdata (this) {
def batchfortime = data.asinstanceof[mutable. hashmap[
Time, Array[offsetrange.offsetrangetuple]]

Override def update (time:time) {
Batchfortime.clear ()
Generatedrdds.foreach {kv =
Val A = Kv._2.asinstanceof[kafkardd[k, V, U, T, R]].offsetranges.map (_.totuple). ToArray
Batchfortime + = Kv._1 A
}
}

Override def Cleanup (time:time) {}

//recover from failure, need to recalculate Generatedrdds

//This is assuming, the topics don ' t change during execution, which is true currently
val topics = Fromoffsets.keyset
val leaders = Kc.findleaders (topics). Fold (
errs = throw new Sparkexception (errs.mkstring ("\ n")),
OK = OK

BatchForTime.toSeq.sortBy (_._1) (time.ordering). foreach {case (t, b) = =
Loginfo (S "Restoring Kafkardd for Time $t ${b.mkstring (" [",", ","] ")}")
Generatedrdds + = t, new kafkardd[k, V, U, T, R] (
Context.sparkcontext, Kafkaparams, B.map (Offsetrange (_)), leaders, MessageHandler)
}
}
}

}

Integration of Spark/kafka

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Integration of Spark/kafka

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Integration of Spark/kafka

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support