Overview
Hadoop's MapReduce and
Spark SQL can only perform offline calculations, and cannot meet business requirements with high real-time requirements, such as real-time recommendations, real-time website performance analysis, etc. Streaming computing can solve these problems. Spark Streaming is now commonly used streaming Calculation framework. As one of the five core components of
Spark, Spark Streaming natively supports access to multiple data sources, and can be used in conjunction with Spark MLLib, Graphx, with high throughput, fault tolerance mechanism, data can be from Kafka, Flume, Twitter, ZeroMQ, Kinesis, or TCP ports can be processed by algorithms similar to advanced functions such as map, reduce, join, and window. Eventually, the processed data can be pushed to disk and database. In short, the function of Spark Streaming is to process data from different data sources in real time and output the results to an external file system.
working principle
Coarse grain
Spark
Streaming receives the real-time data stream, and cuts the data into small data blocks according to the specified time period.
Then pass the small data block to Spark Engine for processing.
Fine-grained
Receive the real-time input data stream, and then split the data into multiple batches. For example, every 1 second of data collected is encapsulated into a batch, and then each batch is handed over to Spark's calculation engine for processing, and finally a result data stream is produced. The data in it is also composed of batches.
Spark Streaming provides a high-level abstraction, called DStream, English is called Discretized Stream, Chinese translation is "discrete stream", it represents a continuous data stream. DStream can be created by input data sources, such as Kafka, Flume, ZMQ, and Kinesis; it can also be created by applying higher-order functions to other DStreams, such as map, reduce, join, window.
Inside DStream, in fact, a series of continuously generated RDDs. RDD is the core abstraction of Spark Core, that is, immutable, distributed data sets. Each RDD in DStream contains data within a time period.
Operators applied to DStream, such as map, are actually translated into operations on each RDD in DStream at the bottom. For example, performing a map operation on a DStream will generate a new DStream. However, at the bottom, in fact, the principle is that the map operation is applied to the RDD of each time period in the input DStream, and then the new RDD generated is the RDD of that time period in the new DStream. The underlying RDD transformation operation.
Still implemented by Spark Core's computing engine. Spark Streaming encapsulates Spark Core, hides the details, and then provides developers with a high-level API that is easy to use.
Introduction to the basic working principle of Spark Streaming
Actual combat
wordcount case (real-time statistics)
Requirement: Enter characters dynamically, and calculate the number of occurrences of entered characters in real time through Spark Streaming.
Code description
A similar case of spark streaming is provided in the examples file installed by spark. You can view the corresponding code on github. We used the case of JavaNetworkWordCount, and named the usage in the code.
We submit jobs on spark in the following two ways.
spark-submit
./spark-submit --master local[2] --class org.apache.spark.examples.streaming.JavaNetworkWordCount --name NetworkWordCount ../examples/jars/spark-examples_2.11-2.1.0.jar localhost 9999
test
nc -lk 9999
If the prompt nc: command not found indicates that the nc package is not installed, use the following command to install
yum install nc -y
yum install nmap -y
spark-shell submission
Start spark-shell
./spark-shell --master local[2]
Execute the following code after startup
import org.apache.spark.streaming.{Seconds,StreamingContext};
val ssc = new StreamingContext(sc, Seconds(1));
val lines = ssc.socketTextStream("192.168.30.130", 9999);
val words = lines.flatMap(_.split(" "));
val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _);
wordCounts.print();
ssc.start();
ssc.awaitTermination();
The difference between the two:
spark-submit is used in a production environment, and spark-shell is used for code testing during development.