Build real-time data processing systems using KAFKA and Spark streaming

Source: Internet
Author: User
Tags scala ide

Original link: http://www.ibm.com/developerworks/cn/opensource/os-cn-spark-practice2/index.html?ca=drs-&utm_source= Tuicool Introduction

In many areas, such as the stock market trend analysis, meteorological data monitoring, website user behavior analysis, because of the rapid data generation, real-time, strong data, so it is difficult to unify the collection and storage and then do processing, which leads to the traditional data processing architecture can not meet the needs. The advent of flow computing is to better solve the problems encountered in the process of such data. Unlike traditional architectures, the flow computing model captures and processes data in real time, and calculates and analyzes data based on business needs, ultimately saving or distributing the results to the required components. This article will start from the real-time data generation and flow of all aspects, through a practical case, to the reader how to use the Apache Kafka and Spark streaming module to build a real-time data processing system, of course, this article is only to stimulate, Because building a good and robust real-time data processing system is not an article that can be made clear. Before reading this article, assume that you have a basic understanding of the Apache Kafka distributed messaging system and that you can use the Spark streaming API for simple programming. Next, let's take a look at how to build a simple real-time data processing system.

About Kafka

Kafka is a distributed, high-throughput, easy-to-expand, topic-based publish/subscribe messaging system that was first developed by Linkedin and was open source and contributed to the Apache Software Foundation in 2011. In general, Kafka has the following typical application scenarios:

    • As a message queue. Because Kafka has high throughput, and built-in message topic partitioning, backup, fault tolerance and other features, it makes it more suitable for use in large-scale, high-intensity message processing systems.
    • Data sources for streaming computing systems. Stream data generation system as the producer of the Kafka message data is distributed to the KAFKA message subject, and the streaming data computing systems (Storm,spark streaming, etc.) consume and calculate the data in real time. This is also the application scenario that this article will cover.
    • The system user behavior data source. In this scenario, the system publishes the user's behavioral data, such as access pages, dwell times, search logs, topics of interest, and other data in real time or periodically to the KAFKA message subject as a source of docking system data.
    • Log aggregation. Kafka can be used as an alternative to a log collection system, and we can aggregate system log data into different Kafka message topics by category.
    • The event source. In an event-driven system, we can design events in a reasonable format and store them as Kafka message data so that the corresponding system modules are processed in real time or on a regular basis. Because Kafka supports large data volumes and has backup and fault tolerance mechanisms, event-driven systems can be made more robust and efficient.

Of course Kafka can also support other application scenarios, where we don't list them all. For more detailed introduction of Kafka, please refer to the Kafka official website. It should be noted that the Kafka version used in this article is based on the 0.8.2.1 Version built in Scala version 2.10.

About Spark Steaming

The Spark streaming module is an extension to spark Core that is designed to handle persistent data flows in a high-throughput, fault-tolerant manner. Currently, the external data sources supported by Spark streaming are Flume, Kafka, Twitter, ZeroMQ, TCP sockets, and so on.

Discretized stream is also called DStream) is a basic abstraction of the Spark streaming for continuous data flow, and in the internal implementation, DStream is represented as a series of continuous rdd (elastic distributed Datasets), each of which represents a certain time Data that arrives within the compartment. So when operating on the DStream, the Spark Stream engine is converted into the operation of the underlying RDD. The types of operations for Dstream are:

    • transformations: Similar to the operation of the RDD, Spark streaming provides a series of conversion operations to support the modification of the DStream. such as Map,union,filter,transform, etc.
    • Window Operations: Windows operations support manipulating data by setting the window length and sliding interval. Common operation has Reducebywindow,reducebykeyandwindow,window and so on.
    • Output Operations: export operation allows the DStream data to be pushed to other external systems or storage platforms, such as HDFS, Database, etc., similar to the RDD action action, the output operation will actually trigger the DSt The ream conversion operation. The common operation has print,saveastextfiles,saveashadoopfiles, Foreachrdd and so on.

For more information on DStream Operations, please refer to the Spark Streaming programing guide on the Spark website.

Kafka Cluster Setup Steps

1. Machine Preparation

In this article, we will prepare three machines to build Kafka clusters, the IP address is 192.168.1.1,192.168.1.2,192.168.1.3, and three machine network interoperability.

2. Download and install kafka_2.10-0.8.2.1

: https://kafka.apache.org/downloads.html

Once the download is complete, upload it to one of the target machines, such as 192.168.1.1, and unzip it using the following command:

Listing 1. Kafka Install package Decompression command
TAR–XVF kafka_2.10-0.8.2.1

Installation is complete.

3. Create the Zookeeper Data directory and set the server number

Perform the following operations on all three servers.

Switch to the current user working directory, such as/home/fams, create a directory where zookeeper holds the data, and then new server number file in this directory.

Listing 2. Create data Catalog and server number file commands
mkdir zk_datacat N > myID

Note that you need to make sure that N takes different values on three servers, such as taking each of them separately.

4. Edit the Zookeeper configuration file

The zookeeper service is built into the Kafka installation package. Enter the Kafka installation directory, such as/home/fams/kafka_2.10-0.8.2.1, to edit the Config/zookeeper.properties file and add the following configuration:

Listing 3. Zookeeper Configuration Items
ticktime=2000datadir=/home/fams/zk_data/clientport=2181initlimit=5synclimit=2server.1= 192.168.1.1:2888:3888server.2=192.168.1.2:2888:3888server.3=192.168.1.3:2888:3888

These configuration items are interpreted as follows:

    • Ticktime:zookeeper the heartbeat interval between servers, in milliseconds.
    • Datadir:zookeeper Data Save directory, we also save Zookeeper server ID file to this directory, described below.
    • The Clientport:zookeeper server listens to the port and waits for the client to connect.
    • The limit of the number of heartbeats that can be tolerated when establishing initial connections between follower servers and leader servers in a initlimit:zookeeper cluster.
    • Synclimit:zookeeper the limit of the number of heartbeats that can be tolerated during requests and responses between follower servers and leader servers in a cluster.
    • Server. N: = is the number of the Zookeeper Cluster server. For the configuration value, take 192.168.1.1:2888:3888 as an example, 192.168.1.1 represents the IP address of the server, port 2888 represents the data exchange port of the server and the leader server, and 3888 represents the communication used to elect the new leader server. Port.

5. Edit Kafka configuration file

A. Editing aconfig/server.properties file

Add or modify the following configuration.

Listing 4. Kafka Broker Configuration Items
broker.id=0port=9092host.name=192.168.1.1zookeeper.contact=192.168.1.1:2181,192.168.1.2:2181,192.168.1.3:2181 Log.dirs=/home/fams/kafka-logs

These configuration entries are interpreted as follows:

    • The unique identity of the Broker.id:Kafka broker, which cannot be duplicated in the cluster.
    • Port:broker listening port for listening to Producer or Consumer connections.
    • Host.name: The IP address or machine name of the current Broker server.
    • Zookeeper.contact:Broker as the client of zookeeper, you can connect the zookeeper address information.
    • Log.dirs: Log Save directory.

B.editing a config/producer.properties file

Add or modify the following configurations:

Listing 5. Kafka Producer Configuration Items
Broker.list=192.168.1.1:9092,192.168.1.2:9092,192.168.1.3:9092producer.type=async

These configuration entries are interpreted as follows:

    • Broker.list: A list of Broker addresses in the cluster.
    • Producer.type:Producer type, async asynchronous producer, sync synchronization producer.

C.editing the Config/consumer.properties file

Listing 6. Kafka Consumer Configuration Items
zookeeper.contact=192.168.1.1:2181,192.168.1.2:2181,192.168.1.3:2181

The configuration entries are interpreted as follows:

    • Zookeeper.contact:Consumer the list of zookeeper server addresses that can be connected.

6. Upload a modified installation package to another machine

Now that we have modified all the required configuration files on the 192.168.1.1 machine, then please package the Kafka installation package with the following command and upload it to 192.168.1.2 and 192.168.1.3 two machines.

Listing 7. To package and upload the Kafka installation package command
[email protected]:/home/fams[email protected]:/home/fams

After uploading, we need to 192.168.1.2 and 192.168.1.3 two machines to unzip the tar package just uploaded, command such as clear single. You will then need to modify the Broker.id and Host.name in the Config/server.properties file separately on both machines. Broker.id, you can copy 1 and 2,host.name separately to change the IP of the current machine.

7. Start Zookeeper and Kafka services

Run the following command on each of the three machines to start the zookeeper and Kafka services.

Listing 8. Start the Zookeeper service
Nohup bin/zookeeper-server-start.sh Config/zookeeper.properties &
Listing 9. Start the Kafka service
Nohup bin/kafka-server-start.sh Config/server.properties &

8. Verifying the installation

We have two verification steps.

The first step is to use the following command on three machines to see if there are Kafka and zookeeper related service processes.

Listing 10. View Kafka and Zookeeper service processes
Ps–ef | grep Kafka

The second step is to create a message subject and verify that the message can be produced and consumed properly through console producer and console consumer.

Listing 11. Create a message topic
bin/kafka-topics.sh--create--replication-factor 3--partition 3--topic user-behavior-topic--zookeeper 192.168.1.1:2181,192.168.1.2:2181,192.168.1.3:2181

Run the following command to open the console producer.

Listing 12. Start the Console Producer
bin/kafka-console-producer.sh--broker-list 192.168.1.1:9092--topic user-behavior-topic

Open the console consumer on another machine.

Listing 13. Start the Console Consumer
./kafka-console-consumer.sh--zookeeper 192.168.1.2:2181--topic user-behavior-topic--from-beginning

Then if you enter a message in producer console, you can see this message from consumer console to indicate that the installation was successful.

Case Introduction and programming implementation

1. Case Introduction

In this case, we assume that a forum needs to be based on the number of user clicks on the site, stay time, and whether to praise, to near real-time calculation of the Web page heat, and then dynamically update the site's Hot spot module, the hottest topic links displayed therein.

2. Case studies

For a user who accesses a forum, we need to make an abstraction of his behavior data in order to explain the computational process of Web topic heat.

First, we use a vector to define the user's behavior as a click on a page, the duration of the stay, and whether or not to like, can be expressed as follows:

(page001.html, 1, 0.5, 1)

The first item of the vector represents the ID of the page, the second indicates the number of clicks from the site to the left, the third indicates the dwell time, in minutes, and the fourth is whether the rep likes, 1 likes, 1 means stepping, and 0 is neutral.

Secondly, we then according to each behavior on the Calculation of Web page topic heat contribution, set a weight, in this article, we assume that the click-Count weight is 0.8, because the user may be because there is no other better topic, so again browse this topic. The dwell time weight is 0.8 because the user may open more than one tab page at a time, but what he really focuses on is just one of the topics. Whether to like the weight is 1, because this generally means that the user is interested in the topic of the page.

Finally, we define a value that calculates the contribution of a behavior data to the heat of a Web page using the following formula.

f (x, Y, z) =0.8x+0.8y+z

So for the above behavior data (page001.html, 1, 0.5, 1), the formula can be used:

H (page001) =f (x, y, z) = 0.8x+0.8y+z=0.8*1+0.8*0.5+1*1=2.2

Readers can note that in this process, we ignore the user itself, that is, we do not care about who the user is, but only focus on its contribution to the heat of the web.

3. Production Behavior Data messages

In this case we will use a program to simulate user behavior, which randomly pushes 0 to 50 behavioral data messages to the User-behavior-topic theme every 5 seconds, and obviously this program plays the role of the message producer, which is typically provided by a system in practical applications. To simplify message processing, we define the format of the message as follows:

Page id| clicks | dwell time (minutes) | Do you like it?

and assume that the site has only 100 pages. The following is the Scala implementation source for this class.

Listing 14. Userbehaviormsgproducer class Source
Import Scala.util.Randomimport Java.util.Propertiesimport Kafka.producer.KeyedMessageimport Kafka.producer.ProducerConfigimport Kafka.producer.Producerclass userbehaviormsgproducer (brokers:String, topic:String) extends Runnable {private Val brokerlist = Brokers private val targettopic = Topic private Val props = new Properties () Props.put ("Metadata.broker.list", This.brokerlist) props.put ("Serializer.class", "Kafka.serializer.StringEncoder" ) Props.put ("Producer.type", "async") private val config = new Producerconfig (this.props) private Val producer = new Produ cer[String,String] (this.config) Private Val page_num = + private Val max_msg_num = 3 private val max_click_time = 5 private Val Max_stay_ Time = ten//like,1;dislike-1; No feeling 0 private val like_or_not = Array[int] (1, 0,-1) def run (): Unit = {val Rand = new Random () while (true) {//h OW many user behavior messages would be produced Val msgnum = Rand.nextint (max_msg_num) + 1 try {//generate the message wi Th format like page1|2|7.123|1 for (i <-0 to Msgnum) {var msg = newStringBuilder() msg.append ("page" + (Rand.nextint (page_num) + 1)) Msg.append ("|") Msg.append (Rand.nextint (max_click_time) + 1) Msg.append ("|") Msg.append (Rand.nextint (max_click_time) + rand.nextfloat ()) Msg.append ("|") Msg.append (LIKE_OR_NOT ( Rand.nextint (3)) println (Msg.tostring ())//send the generated message to broker SendMessage (msg.tostring ())} println (" %d user behavior messages produced. ". Format (msgnum+1)} catch {case E:Exception= = println (e)} try {//sleep for 5 seconds after send a micro batch of message thread.sleep (*)} catch {case E:Exception= = println (e)}}} def sendMessage (message:String) = {try {val data = new keyedmessage[String,String] (this.topic, message); Producer.send (data); } catch {case E:Exception= = println (e)}}}object userbehaviormsgproducerclient {def main (args:array[String] {if (Args.length < 2) {println ("usage:userbehaviormsgproducerclient 192.168.1.1:9092 user-behavior-topic") System.exit (1)}//start the message producer thread new thread (new Userbehaviormsgproducer (args (0), args (1)). Start ()}}

4. Writing Spark Streaming program consumption messages

After figuring out the problem to solve, you can start coding the implementation. For the problem in this case, the basic steps in the implementation are as follows:

    • Build the StreamingContext instance of Spark and turn on the checkpoint feature. Because we need to use the Updatestatebykey primitive to accumulate the heat value of the updated page topic.
    • Consume the message subject using the Kafkautils.createstream method provided by Spark, which returns the Receiverinputdstream object instance.
    • For each message, use the formula above to calculate the heat value of the Web page topic.
    • Define an anonymous function to add the calculated result value and the newly calculated value of the last page heat to get the latest heat value.
    • Call the Updatestatebykey primitive and pass in the anonymous function defined above to update the Web page heat value.
    • When you finally get the latest results, you need to sort the results, and then print the top 10 pages with the highest heat value.

The source code is as follows.

Listing 15. Webpagepopularityvaluecalculator class Source
Import Org.apache.spark.SparkConfimport Org.apache.spark.streaming.Secondsimport Org.apache.spark.streaming.StreamingContextimport Org.apache.spark.streaming.kafka.KafkaUtilsimport Org.apache.spark.HashPartitionerimport Org.apache.spark.streaming.Durationobject Webpagepopularityvaluecalculator {private Val checkpointdir = "Popularity-data-checkpoint" Private Val Msgconsumergroup = "User-behavior-topic-message-consumer-group" def Main (args:array[String] {if (Args.length < 2) {println ("Usage:webpagepopularityvaluecalculator zkserver1:2181, Zkserver 2:2181,zkserver3:2181 consumemsgdatatimeinterval (secs) ") System.exit (1)} Val Array (zkservers,processinginterval) = args val conf = new sparkconf (). Setappname ("Web page popularity Value Calculator") val ssc = new StreamingContext (conf, Se Conds (processinginterval.toint))//using Updatestatebykey asks for enabling Checkpoint Ssc.checkpoint (CheckpointDir) Val kafkastream = kafkautils.createstream (//spark streaming context SSC,//zookeeper quorum. e.g Zkserver1:2181,zkserver 2:2181,... zkservers,//kafka message consumer group ID Msgconsumergroup,//map of (topic_name-numpartitions) to con Sume. Each partition are consumed in its own thread Map ("User-behavior-topic", 3)) Val Msgdatardd = Kafkastream.map (_._2)// For debug with only//println ("Coming data in this interval ...")//msgdatardd.print ()//e.g Page37|5|1.5119122|-1 Val popu Laritydata = msgdatardd.map { Msgline = {val dataarr:array[ String] = Msgline.split ("\\|") val PageID = Dataarr (0)//calculate The popularity value val popvalue:double = Dataarr (1). Tofloa  T * 0.8 + Dataarr (2). tofloat * 0.8 + Dataarr (3). Tofloat * 1 (PageID, Popvalue)}}//sum the previous popularity value and Current value Val updatepopularityvalue = (iterator:Iterator[(String,Seq[Double], option[double])] = = {Iterator.flatmap (t = = {val newvalue:double = t._2.sum val statevalue:double = t._3 . Getorelse (0); Some (newvalue + statevalue)}.map (Sumedvalue = (t._1, Sumedvalue))} val Initialrdd = Ssc.sparkContext.parallelize (L IST (("Page1", 0.00))) Val Statedstream = popularitydata.updatestatebykey[double] (updatepopularityvalue, new Hashpartitioner (Ssc.sparkContext.defaultParallelism), True, Initialrdd)//set the checkpoint interval to avoid too Frequently data checkpoint which may//may significantly reduce operation throughput Statedstream.checkpoint (Duration (8* processinginterval.toint*1000)//after calculation, we need to sort the result and only show the top ten hot pages stateds Tream.foreachrdd {Rdd = {val Sorteddata = rdd.map{case (k,v) = = (V,k)}.sortbykey (false) Val Topkdata = Sortedda  Ta.take (map{case (v,k) = (k,v)} topkdata.foreach (x = {println (x)})}} ssc.start () ssc.awaittermination () }}
Deployment and Testing

Readers can refer to the following steps to deploy and test the sample programs provided in this case.

The first step is to start the behavior message producer program, which can be started directly in the Scala IDE, but you need to add a startup parameter, the first is the Kafka Broker address, and the second is the name of the target message subject.

Figure 1. Userbehaviormsgproducer class Startup Parameters

After you start, you can see that the console has behavior message data generation.

Figure 2. Generated Behavior message Data preview

The second step is to start the spark streaming program that acts as the consumer of the behavior message, which needs to be started in the Spark cluster environment, with the following command:

Listing 16. Webpagepopularityvaluecalculator Class Start command
Bin/spark-submit--jars $SPARK _home/lib/spark-streaming-kafka_2.10-1.3.1.jar, $SPARK _home/lib/ Spark-streaming-kafka-assembly_2.10-1.3.1.jar, $SPARK _home/lib/kafka_2.10-0.8.2.1.jar, $SPARK _home/lib/ Kafka-clients-0.8.2.1.jar \--class Com.ibm.spark.exercise.streaming.WebPagePopularityValueCalculator--master spark://<spark_master_ip>:7077--num-executors 4--driver-memory 4g--executor-memory 2g--executor-cores 2/home /fams/sparkexercise.jar 192.168.1.1:2181,192.168.1.2:2181,192.168.1.3:2181 2

Since we are using or indirectly invoking Kafka's API in our program and need to call the spark streaming integrated Kafka API (Kafkautils.createstream), we need to upload the jar package from the boot command to spark in advance. On each machine in the cluster (in this case we upload them to the Lib directory of the Spark installation directory, i.e., $spark_home/lib) and reference them in the startup command.

After booting, we can see the message print out below the command line console, which is the 10 pages with the highest heat value calculated.

Figure 3. Web topic heat Current Sort preview

We can also go to the spark Web Console to see the current running state of the Spark program, the default address is: http://spark_master_ip:8080.

Figure 4. View the running status of the Spark streaming program

Back to top of page

Precautions

Using Spark streaming to build an efficient and robust streaming data computing system, we also need to pay attention to the following aspects.

  • The need for reasonable set of data processing interval, that is, the need to ensure that the processing time of each batch must be less than the processing interval, to ensure that the next batch of data processing, the previous batch has been processed. Obviously this needs to be determined by your Spark cluster's computational power and the amount of input data.
  • The ability to read input data needs to be increased as much as possible. When the Spark streaming integrates with external systems such as Kafka,flume, we can launch multiple Receiverinputdstream object instances in order to avoid receiving data as a bottleneck in the system.
  • Although in this case, we just print the results of the (near) real-time calculations, in fact many times these results are saved to the database, HDFS, or sent back to Kafka for other systems to use this data for further business processing.
  • Because stream computing requires a high level of real-time, any system pauses caused by the JVM full GC are unacceptable. In addition to the proper use of memory in the program, and the regular cleanup of unwanted cache data, the CMS (Concurrent Mark and Sweep) GC is also the GC method recommended by Spark, which effectively keeps the GC-induced pauses at a very low level. We can add the CMS GC-related parameters by adding the--driver-java-options option when using the Spark-submit command.
  • There are two ways in which Spark officially provides guidance on integrating Kafka and spark streaming, the first of which is receiver Based approach, which implements KAFKA consumer in receiver to Receive message data; the second is Direct approach, which does not pass receiver, but periodically proactively queries the current value of offset in the KAFKA message partition to define the offset range of the messages that need to be processed in each batch. This article uses the first approach, because the second approach is still in the experimental phase.
  • If you use the Receiver Based approach integrated Kafka and Spark streaming, you need to consider the data loss due to Driver or Worker node downtime, which can cause data loss in the default configuration, unless we open The start Write Ahead Log (WAL) feature. In this case, the message data received from the Kafka is synchronously written to the WAL and saved on a reliable distributed file system, such as HDFS. You can turn on this feature by setting the Spark.streaming.receiver.writeAheadLog.enable configuration key to True in the Spark configuration file (conf/spark-defaults.conf). Of course, in the case of opening the WAL, which will cause a single receiver throughput degradation, we may need to run multiple receiver in parallel to improve this situation.
  • Because the Updatestatebykey operation requires the CHECKPOINT function to be turned on, frequent checkpoint can cause program processing time to grow and throughput to degrade. By default, the checkpoint time interval takes the steaming program data processing interval or the larger of the 10 seconds. The official recommended interval is 5-10 times the data processing interval of the streaming program. Can be set by Dsteam.checkpoint (Checkpointinterval), the parameters need to be packaged with a sample class Duration, in milliseconds.

Building real-time data processing systems using Kafka and Spark streaming (RPM)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.