"Frustration translation"spark structure Streaming-2.1.1 + Kafka integration Guide (Kafka Broker version 0.10.0 or higher)

Source: Internet
Author: User
Tags commit documentation json regular expression reset
Note: Spark streaming + Kafka integration Guide

Apache Kafka is a publishing subscription message that acts as a distributed, partitioned, replication-committed log service. Before you begin using Spark integration, read the Kafka documentation carefully.

The Kafka project introduced a new consumer API between 0.8 and 0.10, so there are two separate corresponding spark streaming packages available. Please choose the right package for your broker and the required functions; Please note that 0.8 episodes are compatible with the subsequent 0.9 and 0.10 brokers, but 0.10 episodes are incompatible with the early brokers.

Spark Flow-Kafka 0-8 Spark Flow-Kafka 0-10
Brokers edition 0.8.2.1 or more 0.10.0 or more
API stability Stability Test
Language support Scala,java,python Scala,java
Receiver Dstream Is No
Direct Dstream Is Is
SSL/TLS support No Is
Offset Submission API No Is
Dynamic Theme Subscriptions No Is
=========================================================================== (only the latest structure flow can be used this way at kafka-0.9 or later) Create a Kafka source (batch batch)Each row in the source has the following pattern:

Each row of the source has the following schema:

Column Type
Key Binary
Value Binary
Topic String
Partition Int
Offset Long
Timestamp Long
Timestamptype Int


The following options must be set for the Kafka source (batch processing) and (streaming queries. Streaming query).

Option value meaning
Assign
Assigned
JSON string {"Topica": [0,1], "TOPICB": [2,4]}
(JSON character channeling)
Specific topicpartitions to consume. Only one of the "assign", "subscribe" or "Subscribepattern" options can be specified for Kafka source.
(Specific topics are allocated for consumption.) Only one of the assign, subscribe, or Subscribepattern options can be specified for the Kafka source. )
Subscribe
Subscription
A comma-separated List of topics
(comma-separated list of topics)
The topic list to subscribe. Only one of the "assign", "subscribe" or "Subscribepattern" options can be specified for Kafka source.
(A list of topics to subscribe to.) Only one of the assign, subscribe, or Subscribepattern options can be specified for the Kafka source. )
Subscribepattern
(as on 2 1)
Java Regex string
(Java Regular expression string)
The pattern used to subscribe to topic (s). Only one of the "assign", "subscribe" or "Subscribepattern" options can be specified for Kafka source.
(the mode used to subscribe to topics.) Only one of the assign, subscribe, or Subscribepattern options can be specified for the Kafka source. )
Kafka.bootstrap.servers A comma-separated List of Host:port
(Host: comma-separated list of ports)
The Kafka "Bootstrap.servers" Configuration.bootstrap (boot program). Servers "configuration.
(Kafka "bootstrap.servers" configuration.) )
"Bootstrap.servers", "localhost:9092,anotherhost:9092"
As configured:
Kafkaparams = map[string, Object] (
  "Bootstrap.servers", "localhost:9092",//,anotherhost:9092
  " Key.deserializer "-Classof[stringdeserializer],
  " Value.deserializer ", Classof[stringdeserializer], "
  group.id", "Use_a_separate_group_id_for_each_stream", "
  Auto.offset.reset", "latest",
  " Enable.auto.commit "-(false: Java.lang.Boolean)
)

The following configurations are optional:

Options value default Query Type meaning
Startingoffsets "Earliest", "latest" (streaming only), "earliest", "latest" (Streaming media only)

or JSON string "" "{" Topica ": {" 0 ": 1":-1}, "TOPICB": {"0":-2}} "" "
"Up-to-date" streaming, "oldest" batch Stream and Batch Start of the query starting from the earliest offset "earliest", the latest "newest", that is, the latest offset,
Or a JSON string that specifies the starting offset for each topicpartition.
In JSON, 2 as an offset can be used to refer to the earliest, 1 to indicate the latest.

Note: For batch queries, it is not allowed to use the latest (implicit or by using-1).
For streaming queries, this only applies when a new query is started, and recovery is always taken from where the query left off. Newly discovered partitions start early during the query.
Endingoffsets Up-to-date or JSON string {"Topica": {"0": 1 ":-1}," TOPICB ": {" 0 ":-1}} Latest Batch Query The end point at the end of the batch query, which is the most recent parameter, or a JSON string that specifies the end offset for each topicpartition. In JSON, 1 as an offset can be used to refer to the latest, and 2 (the earliest) as an offset is not allowed.
Failondataloss True or False True Streaming queries Whether the query may fail when data is lost (for example, the subject is deleted or the offset is out of range). This may be a false alarm. You can disable it when you are not working properly. If any data cannot be read from the provided offset due to lost data, the bulk query will always fail.
Kafkaconsumer.polltimeoutms Long 512 Stream and Batch The time-out of the Kafka data is polled by the performer in milliseconds.
Fetchoffset.numretries Int 3 Stream and Batch The number of retries before discarding the get Kafka offset.
Fetchoffset.retryintervalms Long 10 Stream and Batch Millisecond level, and then try again to get Kafka offset
Maxoffsetspertrigger Long None Stream and Batch The rate limit for the maximum offset to be processed per trigger interval. The total offset specified is divided proportionally by the topicpartitions of the different volumes.

Kafka's own configuration can be done by setting datastreamreader.option and Kafka. Prefixes, such as stream.option ("Kafka.bootstrap.servers", "Host:port"). For possible kafkaparams, see the Kafka consumer configuration documentation.

Note that the following Kafka parameter cannot be set, and the Kafka source throws an exception: the Group.id:Kafka source automatically creates a unique group ID for each query.

"Group.id", "Use_a_separate_group_id_for_each_stream",
Auto.offset.reset: Set the source option Startingoffsets to specify where to start. Structured streaming manages which offsets are consumed internally, rather than relying on Kafka consumers. This ensures that no data is lost when a new theme/partition is subscribed dynamically. Note that Startingoffsets is only applicable when a new streaming query is started, and recovery is always taken from where the query left off.
"Auto.offset.reset", "latest",
Key.deserializer: Keys that use Bytearraydeserializer are always deserialized into byte arrays. Use the dataframe operation to explicitly deserialize the key.
"Key.deserializer", Classof[stringdeserializer],
Value.deserializer: The value is always deserialized to a byte array using Bytearraydeserializer. Use the Dataframe action to explicitly deserialize the value.
"Value.deserializer", Classof[stringdeserializer],
Enable.auto.commit: The Kafka source does not commit any offsets.
"Enable.auto.commit" (false: Java.lang.Boolean)
interceptor.classes: The Kafka source always reads the key and the value as a byte array. Using Consumerinterceptor is not secure because it may interrupt the query. Deployment

Like any spark application, Spark-submit is used to launch applications. spark-sql-kafka-0-10_2.11 and its dependencies can be added directly to the spark-submit using--packages, for example,

./bin/spark-submit--packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.1.1 ...

For more details on submitting an application with external dependencies, see the Request submission Guide.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.