Spark Streaming+kafka Real-combat tutorials

Source: Internet
Author: User
Tags message queue xmlns zookeeper

This article reprint please from: Http://qifuguang.me/2015/12/24/Spark-streaming-kafka actual combat course/

Overview

Kafka is a distributed publish-subscribe messaging system, which is simply a message queue, and the benefit is that the data is persisted to disk (the focus of this article is not to introduce Kafka, not much to say). Kafka usage scenarios are still relatively large, such as buffer queues between asynchronous systems, and in many scenarios we will design as follows:

Write some data (such as logs) to Kafka for persistent storage, then another service consumes data from Kafka, does business-level analysis, and then writes the results to HBase or HDFs

Because this design is universal, a big data streaming framework like storm has supported seamless connectivity to Kafka. Of course, as a rising star, Spark also provides native support for Kafka.

This article introduces the actual combat of spark streaming + Kafka. Purpose

This article is to achieve a very simple function:

With log data flowing into Kafka, we use a spark streaming program to consume the log data from the Kafka, which is a string that is then separated by spaces to calculate the number of occurrences of each word in real time. Specific Implementation Deploy zookeeper to the official website download zookeeper unzip

To Zookeeper's Bin directory, start zookeeper with the following command:

1
./zkserver.sh start.. /conf/zoo.cfg 1>/dev/null 2>&1 &

Use the PS command to see if zookeeper has actually started
deploy Kafka to the official website download Kafka unzip

Use the following command to start the Kafka in the bin directory of the Kafka

1
./kafka-server-start.sh. /config/server.properties 1>/dev/null 2>&1 &

Use the PS command to see if Kafka has started writing spark programs using Itellij new MAVEN Project

Pom.xml add spark-streaming related dependencies, because to be combined with Kafka, so also need to add Spark-streaming-kafka package, pom file content is as follows:

1 2 3 4 5 6 7 8 9 Ten-All-in-one-all-in-one-all-in-one 
<?xml version= "1.0" encoding= "UTF-8"?> <project xmlns= "http://maven.apache.org/POM/4.0.0" xmlns:xsi= "htt P://www.w3.org/2001/xmlschema-instance "xsi:schemalocation=" http://maven.apache.org/POM/4.0.0 Http://maven.apach E.org/xsd/maven-4.0.0.xsd "> <modelVersion>4.0.0</modelVersion> &LT;GROUPID&GT;COM.WINWILL.SPARK&L

    T;/groupid> <artifactId>kafka-spark-demo</artifactId> <version>1.0-SNAPSHOT</version>
            <dependencies> <dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-core_2.10</artifactId> <version>1.5.2</version> </depe ndency> <dependency> <groupId>org.apache.spark</groupId> <artifact 
        Id>spark-streaming_2.10</artifactid> <version>1.5.2</version> </dependency> <dependency>
            <groupId>org.apache.spark</groupId> <artifactid>spark-streaming-kafka_2.10</
            artifactid> <version>1.5.2</version> </dependency> <dependency>
            <groupId>org.scala-lang</groupId> <artifactId>scala-library</artifactId>
 <version>2.10.6</version> </dependency> </dependencies> </project>

Writing business logic, in this case we use directstream, the differences between Directstream and stream are described in more detail below. We create a Kafkasparkdemomain class, the code is as follows, there is a detailed comment in the code, there is no more explanation:

1
2
3
4
5
6
7
8
9
30 of each of the above. The all-in-a
-
$
50
Package Com.winwill.spark Import kafka.serializer.StringDecoder import org.apache.spark.SparkConf Import Org.apache.spark.streaming.dstream. {DStream, Inputdstream} import org.apache.spark.streaming. {Duration, StreamingContext} import org.apache.spark.streaming.kafka.KafkaUtils/** * @author Qifuguang * @date 15/12 /25 17:13 */Object Kafkasparkdemomain {def main (args:array[string]) {val sparkconf = new sparkconf (). Setm
        Aster ("local[2]"). Setappname ("Kafka-spark-demo") val SCC = new StreamingContext (sparkconf, Duration (5000)) Scc.checkpoint (".")//Because Updatestatebykey is used, it must be set checkpoint val topics = set ("Kafka-spark-demo")//We need to consume the KAF Ka data topic val kafkaparam = Map ("Metadata.broker.list", "localhost:9091"//Kafka Broker Lis T address) Val stream:inputdstream[(String, string)] = CreateStream (SCC, Kafkaparam, topics) stre Am.map (_._2)//Remove Value FlatMap (_.split (""))//Add WordStrings are separated by spaces. Map (R = (r, 1))//each word is mapped into a pair. Updatestatebykey[int] (Updatefunc)//with current BATC

    H data area to update existing data. Print ()//printing the first 10 data Scc.start ()//Real launcher scc.awaittermination ()//Blocking Wait}
        Val Updatefunc = (Currentvalues:seq[int], prevalue:option[int]) = {val Curr = currentvalues.sum
     Val pre = prevalue.getorelse (0) Some (Curr + Pre)}/** * Create a stream to fetch data from Kafka.
     * @param SCC Spark Streaming context * @param kafkaparam Kafka Related Configuration * @param topics topic collection to consume * @return */def createstream (Scc:streamingcontext, kafkaparam:map[string, String], topics:set[string]) =
 {kafkautils.createdirectstream[string, String, Stringdecoder, Stringdecoder] (SCC, Kafkaparam, Topics)}}
See effect

Run the Spark program

Use the Kafka-console-producer tool to write the following data in Kafka

Observe the output of the Spark program


It can be seen that as long as we write data to Kafka, the spark program can be real-time (not real, it depends on how much duration is set, for example, 5s is set, there may be 5s processing delay) to count the number of occurrences of each word so far. the difference between Directstream and stream

From a high-level perspective, the previous and Kafka integration Scenarios (Reciever method) use Wal to work as follows: Kafka receivers running on Spark workers/executors continuously reads data from Kafka, It uses a high-level consumer API in Kafka. The received data is stored in the spark workers/executors memory and is also written to the Wal. Only the received data is persisted to log, Kafka receivers will update the Kafka offset in zookeeper. The received data and the Wal storage location information are reliably stored and, if a failure occurs during the period, the information is used to recover from the error and continue to process the data.

This method guarantees that the data received from Kafka is not lost. However, in the case of failure, some data is likely to be processed more than once. In this case, some of the received data is reliably saved to the Wal, but there is no time to update the Kafka offset in the zookeeper, which occurs in the event of a system failure. This leads to inconsistencies in the data: Spark streaming knows that the data is being received, but Kafka that the data has not been received so that Kafka will send the data again when the system returns to normal.

The reason for this inconsistency is that the two systems are unable to atomically manipulate the data information that has been received. To solve this problem, only one system is needed to maintain the consistent views that have been sent or received, and the system needs to have all the control rights to recover from the failure. Based on these considerations, the community decided to store all consumption offset information only in spark streaming and use Kafka's low-level consumer API to recover data from any location .

To build this system, the newly introduced direct API takes a completely different approach than receivers and wals. It does not start a receivers to continuously receive data from the Kafka and write to the Wal, and simply gives the offset position that each batch interval needs to read, and finally, the job of each batch is run, Those data that correspond to offsets are ready in Kafka . These offset information is also reliably stored (checkpoint), which can be read directly when recovering from failure.

It is important to note that Spark streaming can re-read and process those data segments from Kafka after a failure. However, because the semantics are processed only once, the result of the final re-processing is consistent with the result of no failure processing.

As a result, the Direct API eliminates the need to use Wal and receivers, and ensures that each Kafka record is received only once and is efficiently received. This allows us to integrate spark streaming and Kafka nicely together. In general, these features make flow-processing pipelines highly fault-tolerant, efficient, and easy to use.

The contents of this section refer to: http://dataunion.org/12102.html

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.