With the continuous development of big data technology, people's requirements for real-time processing of big data are also increasing. Traditional batch processing frameworks such as MapReduce have gradually failed in certain specific areas, such as real-time user recommendation and user behavior analysis. To meet people's needs for real-time performance, a batch of streaming analysis and real-time computing frameworks such as S3, Samza, and Storm were born. Because of its excellent internal scheduling mechanism and fast distributed computing capabilities, Spark can perform iterative calculations at an extremely fast speed. Because of such advantages, Spark can perform real-time processing to some extent, and Spark Streaming is a streaming framework built on it.
Introduction to Streaming Big Data Processing Framework
Samza
Samza is a distributed streaming data processing framework (streaming processing), an open source product of Linkedin. It is based on Kafka message queue to achieve real-time streaming data processing. A more accurate statement is that Samza uses Apache Kafka in a modular form, so it can be built on other message queue frameworks, but the starting point and default implementation is based on Apache Kafka.
Essentially, Samza is a higher-level abstraction on the message queue system, and is an implementation of an application pattern that applies a streaming framework to the message queue system.
In general, compared with Storm, Samza is completely based on Apache Kafka in transmission, and cluster management is based on Hadoop YARN, that is, Samza is only responsible for handling this specific business, plus RocksDB-based state management. Limited by Kafka and YARN, its topology is not flexible enough.
Storm
Storm is an open source, big data processing system. Unlike other systems, it is designed for distributed real-time processing and is language independent. Storm is not just a traditional big data analysis system, it can be used to build complex event processing (CEP) systems. Functionally, CEP systems are generally classified into two categories, computing and detection-oriented. Both can be implemented in Storm through user-defined algorithms. For example, CEP can be used to identify meaningful events in the event torrent and then process these events in real time.
The difference between the Storm framework and other big data solutions is the way it is processed. Apcahe Hadoop is essentially a batch processing system, that is, the target application mode is mainly for offline analysis. The data is introduced into Hadoop's distributed file system (HDFS), and is evenly distributed to each node for processing. HDFS data balance rules can refer to the article "HDFS Data Balance Rules and Experiment Introduction" published by the author of this article in IBM for in-depth To understanding. When the processing is complete, the resulting data is returned to HDFS and can then be used by the processing initiator. Storm supports the creation of topologies to transform data streams without end points. Unlike Hadoop jobs, these conversions never stop automatically, and they continue to process incoming data, which is Storm's streaming real-time processing method.
Spark Streaming
Spark Streaming is similar to Apache Storm and is used for streaming data processing. According to its official documentation, Spark Streaming has two characteristics of high throughput and strong fault tolerance. Spark Streaming supports many data input sources, such as Kafka, Flume, Twitter, ZeroMQ, simple TCP sockets, and so on. After the data is input, Spark's highly abstract primitives such as map, reduce, join, window, etc. can be used for operations. The results can also be stored in many places, such as HDFS, databases, etc. In addition, Spark Streaming can also be perfectly integrated with MLlib (machine learning) and Graphx.
In Spark Streaming, the unit of processing data is a batch instead of a single piece, but the data collection is carried out one by one. Therefore, the Spark Streaming system needs to set an interval so that the data is aggregated to a certain amount and then operate together. This interval is batch processing. interval. Batch processing interval is the core concept and key parameter of Spark Streaming. It determines the frequency of Spark Streaming job submission and the delay of data processing. It also affects the throughput and performance of data processing.
We can start the WordCount program by the following command, as shown in Listing 1.
Spark Streaming example
Listing 1. Run the WordCount program
./bin/run-example org.apache.spark.examples.streaming.JavaRecoverableNetworkWordCount
localhost 9999 wordcountdata wordcountdata
As an application framework built on Spark, Spark Streaming inherits the programming style of Spark.
Listing 2. WordCount sample source code
JavaStreamingContextFactory factory = new JavaStreamingContextFactory() {
@Override
public JavaStreamingContext create() {
return createContext(ip, port, checkpointDirectory, outputPath);
}
};
SparkConf sparkConf = new SparkConf().setAppName("JavaRecoverableNetworkWordCount");
// Create the context with a 1 second batch size
//First create a Spark Streaming process through JavaStreamingContextFactory.
JavaStreamingContext ssc = new JavaStreamingContext(sparkConf, Durations.seconds(1));
ssc.checkpoint(checkpointDirectory);
// Create a socket stream on target ip:port and count the
// words in input stream of \n delimited text (eg. generated by'nc')
JavaReceiverInputDStream<String> lines = ssc.socketTextStream(ip, port);
JavaDStream<String> words = lines.flatMap(new FlatMapFunction<String, String>() {
@Override
public Iterable<String> call(String x) {
return Lists.newArrayList(SPACE.split(x));
}
});
JavaPairDStream<String, Integer> wordCounts = words.mapToPair(
new PairFunction<String, String, Integer>() {
@Override
public Tuple2<String, Integer> call(String s) {
return new Tuple2<String, Integer>(s, 1);
}
}).reduceByKey(new Function2<Integer, Integer, Integer>() {
@Override
public Integer call(Integer i1, Integer i2) {
return i1 + i2;
}
});
wordCounts.foreachRDD(new Function2<JavaPairRDD<String, Integer>, Time, Void>() {
@Override
public Void call(JavaPairRDD<String, Integer> rdd, Time time) throws IOException {
String counts = "Counts at time "+ time +" "+ rdd.collect();
System.out.println(counts);
System.out.println("Appending to "+ outputFile.getAbsolutePath());
Files.append(counts + "\n", outputFile, Charset.defaultCharset());
return null;
}
});
JavaStreamingContextFactory factory = new JavaStreamingContextFactory() {
@Override
public JavaStreamingContext create() {
return createContext(ip, port, checkpointDirectory, outputPath);
}
};
JavaStreamingContext ssc = JavaStreamingContext.getOrCreate(checkpointDirectory, factory);
ssc.start();
ssc.awaitTermination();
As shown in Listing 2, building a Spark Streaming application generally requires 4 steps.
Construct a Streaming Context object
Just as Spark initially needs to create SparkContext objects, using Spark Streaming requires the creation of StreamingContext objects. The parameters required to create a StreamingContext object are basically the same as SparkContext, including specifying the master and setting the name. It should be noted that the parameter Second(1), Spark Streaming needs to specify the time interval for processing data, such as 1s, then Spark Streaming will use 1s as the time window for data processing. This parameter needs to be set appropriately according to the needs of the user and the processing power of the cluster. Its life cycle will be accompanied by the entire life cycle of the StreamingContext and cannot be reset. Therefore, users need to set a reasonable time interval based on demand and cluster processing capacity.
Create InputDStream
Like Strom's Spout, Spark Streaming needs to specify the data source. For example, socketTextStream, Spark Streaming will use a socket connection as a data source to read data. Of course, Spark Streaming supports many different data sources, including kafkaStream, flumeStream, fileStream, networkStream, etc.
Operation DStream
For the DStream obtained from the data source, users can perform various operations based on it. For example, the operation of WordCount is a typical word count execution process, that is, the data obtained from the data source in the current time window is segmented, and then MapReduce is used. Algorithm mapping and calculation, and finally use print() to output the result.
Start Spark Streaming
All the previous steps only created an execution process. The program did not actually connect to the data source, nor did it perform any operations on the data. It was just that all the execution calculations were set. When ssc.start() was started, the program really proceeded. All expected operations.