StreamingContext, DStream, receiver depth analysis
This lesson is divided into four parts to explain, the first part of StreamingContext function and source code analysis, the second part of DStream function and source analysis; The third part of receiver function and source analysis; the last part will StreamingContext, DStream , receiver combined to analyze its process.
First, StreamingContext function and source code analysis:
1. Create the application main portal via the Spark streaming object JSSC and write to the source data on the receiving data service port 9999 on the driver:
2. The main functions of Spark streaming are:
- The entrance of the main program;
- Various methods of creating dstream are provided to receive various incoming data sources (for example: Kafka, Flume, Twitter, ZEROMQ, and simple TCP sockets);
- When instantiating a spark streaming object through a constructor, you can specify the master URL, AppName, or an incoming Sparkconf configuration object, or a Sparkcontext object that has already been created;
- The incoming data is streamed into the Dstreams object;
- Start the current application's flow computing framework by using the start method of the spark streaming object instance or end the current application's flow computing framework through the Stop method;
Second, Dstream function and source code analysis:
1, Dstream is the template of the RDD, Dstream is abstract, the RDD is also abstract
2, Dstream the implementation of the sub-class as shown:
3, take StreamingContext instance of the Sockettextsteam method as an example, the result of its execution returns Dstream object instance, its source code call process such as:
Socket.getinputstream get data, while loop to store savings data (memory, disk)
Third, receiver function and source code analysis:
1, receiver represents the input of data, receive external input data, such as fetching data from Kafka;
2, receiver running on the worker node;
3, receiver on the worker node crawl Kafka distributed message Framework data, the implementation of the specific class is kafkareceiver;
4, receiver is an abstract class, the implementation of its fetching data subclass as shown:
5, if the above implementation classes do not meet your requirements, you can define the receiver class, you only need to inherit the receiver abstract class to achieve their own sub-class business requirements.
Four, StreamingContext, DStream, receiver combined flow analysis:
(1) InputStream represents the data input stream (for example: Socket, Kafka, flume, etc.)
(2) Transformation represents a series of operations on the data, such as FLATMAP, map, etc.
(3) OutputStream represents the output of the data, such as the Println method in WordCount:
The data data will eventually generate the job after the flow comes in, and ultimately the execution is based on the Spark Core's RDD: Dstream when processing incoming data transformation because it's streamingcontext, it doesn't run at all, StreamingContext will generate "Dstream chains" and dstreamgraph based on transformation, and Dstreamgraph is the template for the DAG, which is managed by the framework. When we specify a time interval, the driver end will trigger the job based on the interval to trigger the jobs based on the specific function specified in the Outputdstream, such as print in WordCount, This function is bound to pass to Foreachdstream, which will hand over the function to the last Dstream-generated rdd, the RDD print operation, which is the RDD trigger action.
Summarize:
With spark streaming you can handle a variety of data source types, such as database, HDFS, server log logs, network streams, which are more powerful than you might imagine, but are often not used by people, and the real reason for this is the spark, spark Streaming itself does not understand.
Written by: Imf-spark Steaming enterprise-level development Practical Team
Main editor: Liaoliang
Note:
Data from: Dt_ Big Data DreamWorks (the fund's legendary action secret course)
For more private content, please follow the public number: Dt_spark
If you are interested in big data spark, you can listen to it free of charge by Liaoliang teacher every night at 20:00 Spark Permanent free public class, address yy room Number: 68917580
84 Lessons: StreamingContext, DStream, receiver depth analysis