Features:
Spark Streaming can achieve streaming processing of real-time data streams, and has good scalability, high throughput and fault tolerance.
Spark
Streaming supports extracting data from multiple data sources, such as: Kafka, Flume, Twitter, ZeroMQ, Kinesis, and TCP sockets, and can provide some advanced APIs to express complex processing algorithms, such as map, reduce, join, and window Wait.
Spark Streaming supports pushing the processed data to the file system, database or real-time dashboard for display.
You can apply Spark's machine learning and graph processing algorithms to Spark Streaming's data streams.
Spark Streaming accesses data from real-time data streams, and then divides them into small batches for subsequent Spark engine processing, so in fact, Spark Streaming processes data streams in small batches.
Spark Streaming provides a high-level abstraction for this continuous data stream, namely: discrete stream (discrete data stream) or DStream. DStream can be created from input data sources, such as Kafka, Flume, or Kinesis, or it can be obtained from other DStream through some operator operations. In fact, internally, a DStream contains a series of RDDs.
Introductory example analysis
SparkConf conf = new SparkConf().setAppName("stream1").setMaster("local[2]");
JavaStreamingContext jsc = new JavaStreamingContext(conf, Durations.seconds(1));
JavaReceiverInputDStream<String> lines = jsc.socketTextStream("localhost", 9999);
JavaPairDStream<String, Long> pairs=
lines.flatMap((str)->Arrays.asList(str.split(" ")).iterator())
.mapToPair((str)->new Tuple2<String,Long>(str,1L));
JavaPairDStream<String, Long> res=pairs.reduceByKey((v1,v2)->v1+v2);
res.print();
jsc.start();
try {
jsc.awaitTermination();
} catch (InterruptedException e) {
e.printStackTrace();
}
StreamingContext is the entrance to Spark Streaming. And set the batch interval to 1 second.
Using this context object (StreamingContext), we can create a DStream that represents the data stream that flows from the previous TCP data source, and the TCP data source is described by the host name (such as: hostnam) and port (such as: 9999) of.
The lines here are the data streams received from the data server. Each record is a line of text. Next, we need to split these lines of text into words by spaces.
flatMap is a "one-to-many" (one-to-many) mapping operator, which can map each record in the source DStream into multiple records, thereby generating a new DStream object. In this example, each line in lines will be mapped to multiple words by flatMap, thereby generating a new words DStream object. Then, we can count these words.
The words DStream object is converted into a DStream object pairs containing (word, 1) key-value pairs through the map operator (one-to-one mapping), and then the reduce operator is used on the pairs to get the appearance of each word in each batch frequency.
Note that after executing the above code, Spark Streaming just set the calculation logic well, and did not really start processing data at this time. To start the processing logic before, we also need to call as follows:
ssc.start() // Start streaming calculation
ssc.awaitTermination() // wait until the calculation is terminated
First, you need to run netcat (Unix-like system will have this gadget), as a data server
$ nc -lk 9999
Then, execute the program. Now you can try to type a few words in the terminal running netcat, you will find these words and the corresponding count will appear on the terminal screen to start the Spark Streaming example.
Note that StreamingContext creates a SparkContext object internally (SparkContext is the entry point for all Spark applications, which can be accessed in the StreamingContext object like this: ssc.sparkContext).
StreamingContext also has another construction parameter, namely: batch interval, the size of this value needs to be determined according to the specific needs of the application and the available cluster resources.
Points to watch:
Once streamingContext is started, its calculation logic can no longer be added or modified.
Once the streamingContext is stopped, it cannot be restarted.
A single JVM virtual machine can only contain one active StreamingContext at a time.
StreamingContext.stop() will also stop the associated SparkContext object. If you do not want to stop the SparkContext object, you can set the optional parameter stopSparkContext of StreamingContext.stop to false.
A SparkContext object can be associated with multiple StreamingContext objects, as long as the previous StreamingContext.stop (sparkContext=false), and then create a new StreamingContext object.
Discrete data streams (DStreams)
Discrete data stream (DStream) is the most basic abstraction of Spark Streaming. It represents a continuous stream of data, either extracting data from a certain data source, or mapping from other data streams. DStream internally consists of a series of continuous RDDs. Each RDD contains a batch of data within a specific time interval, as shown in the following figure:
Any operator that acts on DStream will actually be converted into an operation on its internal RDD. The underlying RDD conversion is still calculated by the Spark engine. DStream's operators hide these details and provide developers with more convenient high-level APIs.
Input DStream and receiver
The input DStream represents the data stream that flows in from some kind of streaming data source. In the previous example, the lines object is the input DStream, which represents the data stream received from the netcat server. Each input DStream (except the file data stream) is associated with a receiver (Receiver), and the receiver is an object that specifically pulls data from the data source into memory.
Spark Streaming mainly provides two built-in streaming data sources:
Basic sources: Sources that can be used directly in the StreamingContext API, such as file systems, socket connections, or Akka actors.
Advanced sources: Sources that need to rely on additional tools, such as Kafka, Flume, Kinesis, Twitter and other data sources. These data sources need to add additional dependencies, see the linking (linking) section for details.
Note that if you need to pull data from multiple data sources at the same time, then you need to create multiple DStream objects. Multiple DStream objects actually create multiple data stream receivers at the same time. However, please note that Spark workers/executors are all long-term running, so they will each occupy a CPU allocated to the Spark Streaming application.
Therefore, when running locally, be sure to set the master to "local[n]", where n> the number of receivers.
When a Spark Streaming application is placed in a cluster to run, similarly, the number of CPU cores allocated to the application must be greater than the total number of receivers. Otherwise, the app will only receive data, not process it.
Basic data source
Use ssc.socketTextStream(…) to receive text data from a TCP connection. In addition to TCP sockets, the StreamingContext API also supports pulling data from files or Akka actors.
File stream (File Streams): It can be created from any file system compatible with HDFS API (including: HDFS, S3, NFS, etc.) as follows:
streamingContext.fileStream<KeyClass, ValueClass, InputFormatClass>(dataDirectory);
Spark Streaming will monitor the dataDirectory directory and process any new files in the directory (currently, nested directories are not supported). note:
The data format of each file must be consistent.
The files in dataDirectory must be created by moving or renaming.
Once the file is moved into dataDirectory, it cannot be changed. So if this file is subsequently written, the newly written data will not be read.
For simple text files, the simpler way is to call streamingContext.textFileStream(dataDirectory).
In addition, the file data stream is not based on the receiver, so there is no need to allocate a CPU core for it.
Queue of RDDs as a Stream: If you need to test Spark Streaming applications, you can create a DStream object based on a batch of RDDs, just call streamingContext.queueStream(queueOfRDDs). RDD will be pushed into the queue one by one, and DStream will sequentially process the data of these RDDs in the form of data stream.
Custom data source
Input DStream can also be created in a custom way. All you need to do is implement a custom receiver to receive data from a custom data source, and then push the data into Spark. See: http://spark.apache.org/docs/...
Receiver reliability
From a reliability perspective, there are roughly two types of data sources. Among them, data sources like Kafka and Flume support the confirmation of transmitted data. The system receives the data from such reliable data sources, and then sends a confirmation message, so as to ensure that no data will be lost in any failure. Therefore, we can divide the receiver into two categories accordingly:
Reliable Receiver (Reliable Receiver)-Reliable receiver will send a confirmation message to a reliable data source after successfully receiving and saving a copy of Spark data.
Unreliable Receiver (Unreliable Receiver)-Unreliable receiver will not send any confirmation message.