Spark Streaming+kafka Real-combat tutorials

Last Update:2018-07-26 Source: Internet

Author: User

Tags message queue xmlns zookeeper

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

This article reprint please from: Http://qifuguang.me/2015/12/24/Spark-streaming-kafka actual combat course/

Overview

Kafka is a distributed publish-subscribe messaging system, which is simply a message queue, and the benefit is that the data is persisted to disk (the focus of this article is not to introduce Kafka, not much to say). Kafka usage scenarios are still relatively large, such as buffer queues between asynchronous systems, and in many scenarios we will design as follows:

Write some data (such as logs) to Kafka for persistent storage, then another service consumes data from Kafka, does business-level analysis, and then writes the results to HBase or HDFs

Because this design is universal, a big data streaming framework like storm has supported seamless connectivity to Kafka. Of course, as a rising star, Spark also provides native support for Kafka.

This article introduces the actual combat of spark streaming + Kafka. Purpose

This article is to achieve a very simple function:

With log data flowing into Kafka, we use a spark streaming program to consume the log data from the Kafka, which is a string that is then separated by spaces to calculate the number of occurrences of each word in real time. Specific Implementation Deploy zookeeper to the official website download zookeeper unzip

To Zookeeper's Bin directory, start zookeeper with the following command:

1	./zkserver.sh start.. /conf/zoo.cfg 1>/dev/null 2>&1 &

Use the PS command to see if zookeeper has actually started
deploy Kafka to the official website download Kafka unzip

Use the following command to start the Kafka in the bin directory of the Kafka

1	./kafka-server-start.sh. /config/server.properties 1>/dev/null 2>&1 &

Use the PS command to see if Kafka has started writing spark programs using Itellij new MAVEN Project

Pom.xml add spark-streaming related dependencies, because to be combined with Kafka, so also need to add Spark-streaming-kafka package, pom file content is as follows:

1 2 3 4 5 6 7 8 9 Ten-All-in-one-all-in-one-all-in-one

<?xml version= "1.0" encoding= "UTF-8"?> <project xmlns= "http://maven.apache.org/POM/4.0.0" xmlns:xsi= "htt P://www.w3.org/2001/xmlschema-instance "xsi:schemalocation=" http://maven.apache.org/POM/4.0.0 Http://maven.apach E.org/xsd/maven-4.0.0.xsd "> <modelVersion>4.0.0</modelVersion> &LT;GROUPID&GT;COM.WINWILL.SPARK&L

    T;/groupid> <artifactId>kafka-spark-demo</artifactId> <version>1.0-SNAPSHOT</version>
            <dependencies> <dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-core_2.10</artifactId> <version>1.5.2</version> </depe ndency> <dependency> <groupId>org.apache.spark</groupId> <artifact 
        Id>spark-streaming_2.10</artifactid> <version>1.5.2</version> </dependency> <dependency>
            <groupId>org.apache.spark</groupId> <artifactid>spark-streaming-kafka_2.10</
            artifactid> <version>1.5.2</version> </dependency> <dependency>
            <groupId>org.scala-lang</groupId> <artifactId>scala-library</artifactId>
 <version>2.10.6</version> </dependency> </dependencies> </project>

Writing business logic, in this case we use directstream, the differences between Directstream and stream are described in more detail below. We create a Kafkasparkdemomain class, the code is as follows, there is a detailed comment in the code, there is no more explanation:

1
2
3
4
5
6
7
8
9
30 of each of the above. The all-in-a
-
$
50

Package Com.winwill.spark Import kafka.serializer.StringDecoder import org.apache.spark.SparkConf Import Org.apache.spark.streaming.dstream. {DStream, Inputdstream} import org.apache.spark.streaming. {Duration, StreamingContext} import org.apache.spark.streaming.kafka.KafkaUtils/** * @author Qifuguang * @date 15/12 /25 17:13 */Object Kafkasparkdemomain {def main (args:array[string]) {val sparkconf = new sparkconf (). Setm
        Aster ("local[2]"). Setappname ("Kafka-spark-demo") val SCC = new StreamingContext (sparkconf, Duration (5000)) Scc.checkpoint (".")//Because Updatestatebykey is used, it must be set checkpoint val topics = set ("Kafka-spark-demo")//We need to consume the KAF Ka data topic val kafkaparam = Map ("Metadata.broker.list", "localhost:9091"//Kafka Broker Lis T address) Val stream:inputdstream[(String, string)] = CreateStream (SCC, Kafkaparam, topics) stre Am.map (_._2)//Remove Value FlatMap (_.split (""))//Add WordStrings are separated by spaces. Map (R = (r, 1))//each word is mapped into a pair. Updatestatebykey[int] (Updatefunc)//with current BATC

    H data area to update existing data. Print ()//printing the first 10 data Scc.start ()//Real launcher scc.awaittermination ()//Blocking Wait}
        Val Updatefunc = (Currentvalues:seq[int], prevalue:option[int]) = {val Curr = currentvalues.sum
     Val pre = prevalue.getorelse (0) Some (Curr + Pre)}/** * Create a stream to fetch data from Kafka.
     * @param SCC Spark Streaming context * @param kafkaparam Kafka Related Configuration * @param topics topic collection to consume * @return */def createstream (Scc:streamingcontext, kafkaparam:map[string, String], topics:set[string]) =
 {kafkautils.createdirectstream[string, String, Stringdecoder, Stringdecoder] (SCC, Kafkaparam, Topics)}}

See effect

Run the Spark program

Use the Kafka-console-producer tool to write the following data in Kafka

Observe the output of the Spark program

It can be seen that as long as we write data to Kafka, the spark program can be real-time (not real, it depends on how much duration is set, for example, 5s is set, there may be 5s processing delay) to count the number of occurrences of each word so far. the difference between Directstream and stream

From a high-level perspective, the previous and Kafka integration Scenarios (Reciever method) use Wal to work as follows: Kafka receivers running on Spark workers/executors continuously reads data from Kafka, It uses a high-level consumer API in Kafka. The received data is stored in the spark workers/executors memory and is also written to the Wal. Only the received data is persisted to log, Kafka receivers will update the Kafka offset in zookeeper. The received data and the Wal storage location information are reliably stored and, if a failure occurs during the period, the information is used to recover from the error and continue to process the data.

This method guarantees that the data received from Kafka is not lost. However, in the case of failure, some data is likely to be processed more than once. In this case, some of the received data is reliably saved to the Wal, but there is no time to update the Kafka offset in the zookeeper, which occurs in the event of a system failure. This leads to inconsistencies in the data: Spark streaming knows that the data is being received, but Kafka that the data has not been received so that Kafka will send the data again when the system returns to normal.

The reason for this inconsistency is that the two systems are unable to atomically manipulate the data information that has been received. To solve this problem, only one system is needed to maintain the consistent views that have been sent or received, and the system needs to have all the control rights to recover from the failure. Based on these considerations, the community decided to store all consumption offset information only in spark streaming and use Kafka's low-level consumer API to recover data from any location .

To build this system, the newly introduced direct API takes a completely different approach than receivers and wals. It does not start a receivers to continuously receive data from the Kafka and write to the Wal, and simply gives the offset position that each batch interval needs to read, and finally, the job of each batch is run, Those data that correspond to offsets are ready in Kafka . These offset information is also reliably stored (checkpoint), which can be read directly when recovering from failure.

It is important to note that Spark streaming can re-read and process those data segments from Kafka after a failure. However, because the semantics are processed only once, the result of the final re-processing is consistent with the result of no failure processing.

As a result, the Direct API eliminates the need to use Wal and receivers, and ensures that each Kafka record is received only once and is efficiently received. This allows us to integrate spark streaming and Kafka nicely together. In general, these features make flow-processing pipelines highly fault-tolerant, efficient, and easy to use.

The contents of this section refer to: http://dataunion.org/12102.html

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More