Spark Streaming is an extension of the core Spark API to enable scalable, high-throughput, fault-tolerant streaming of real-time data streams. data can be obtained from many sources, such as kafka,flume,kinesis or TCP sockets, and can be processed using complex algorithms represented by advanced functions map , for example reduce , join and window . Finally, the processed data can be pushed to the file system, database and real-time dashboards
internally, it works as follows. Spark streaming receives real-time input data streams and divides the data into batches, which are then processed by the Spark engine to batch generate the final result stream
calculation Flow :Spark Streaming is the decomposition of streaming calculations into a series of short batch jobs. The batch engine here is Spark Core, which divides the input data of spark streaming into a piece of data (discretized Stream) in batch size (for example, 1 seconds). Each piece of data is converted to the RDD (resilient distributed Dataset) in Spark, and then the spark streaming in the dstream The transformation operation becomes a transformation operation on the RDD in Spark , and the Rdd is manipulated into intermediate results in memory. The entire streaming calculation can overlay intermediate results or store them on an external device, depending on the needs of the business
WordCount example Implementation (JAVA)
Package Stuspark.com;import Scala. Tuple2;import Org.apache.spark.sparkconf;import Org.apache.spark.api.java.function.flatmapfunction;import Org.apache.spark.api.java.function.function2;import Org.apache.spark.api.java.function.pairfunction;import Org.apache.spark.streaming.durations;import Org.apache.spark.streaming.api.java.javadstream;import Org.apache.spark.streaming.api.java.javapairdstream;import Org.apache.spark.streaming.api.java.javareceiverinputdstream;import Org.apache.spark.streaming.api.java.javastreamingcontext;import Java.util.arrays;public class JavaSparkStreaming { public static void Main (string[] args) {//javastreamingcontext object, which is the primary entry point for the stream function. Creates a local streamingcontext with two execution threads with a batch interval of 1 seconds. The sparkstreaming must be a number greater than or equal to 2 "that is at least 2 threads" after local. Because receiver accounts for a continuous loop of receiving data sparkconf conf = new sparkconf (). Setmaster ("Local[2"). Setappname ("Networkwordcount"); Javastreamingcontext JSSC = new Javastreamingcontext (conf, durations.seconds (1));//Create a dstream to represent the stream data from the TCP source, Specified as host name (for example, localhost) and port (for example, 9999) JavAreceiverinputdstream<string> lines = Jssc.sockettextstream ("localhost", 9999),//linesdstream represents the data stream that will be received from the data server. Each record in this stream is a line of text. Then split the word with a space//Flatmap is a dstream operation that creates a new dstream by generating multiple new records from each record in the source Dstream. There are many such convenience classes in the Convert Java API that are defined using the Flatmapfunction object to help define the Dstream transformation. Dstream is an rdd-generated template that, before the spark streaming occurs, essentially translates the dstream operation of each batch into an RDD operation javadstream<string> words = Lines.flatmap (New flatmapfunction<string, string> () {@Override public iterable<string> call (String x ) {return (iterable<string>) arrays.aslist (X.split ("")). iterator (); }});//calculate javapairdstream<string, integer> pairs = Words.maptopair (new pairfunction<string, String, Integer> ;() {@Override public tuple2<string, integer> call (String s) {return new tuple2<> (S, 1); } }); javapairdstream<string, integer> wordcounts = Pairs.reducebykey (new Function2<integer, Integer, Integer> ( ) {@Override public integer call (integer i1, integer i2) {return i1 + i2; }); Wordcounts.print ();//print the first 10 elements of each RDD generated in this dstream to console Jssc.start (); Start Jssc.awaittermination (); Wait for calculation to terminate}}
1. Creating a StreamingContext object As with spark initialization requires creating a Sparkcontext object, you need to create a StreamingContext object using spark streaming. The parameters required to create the StreamingContext object are basically consistent with the Sparkcontext, including specifying master, setting the name (such as Networkwordcount). Note that the parameter seconds (1), spark streaming need to specify the time interval to process the data, as shown in the example above 1s, then spark streaming will be 1s as the time window for data processing. This parameter needs to be set appropriately according to the user's requirement and the processing ability of the cluster;
2. Create Inputdstream like storm Spout,spark streaming need to indicate the data source. As shown in the example above, Sockettextstream,spark streaming reads data as a socket connection as a data source. Of course, spark streaming supports a variety of different data sources, including Kafka, Flume, Hdfs/s3, Kinesis, and Twitter.
3. operation Dstream for the Dstream obtained from the data source, the user can perform various operations on its basis, The operation shown in the example above is a typical wordcount execution process: The data obtained from the data source in the current time window is first segmented, then computed using the map and Reducebykey methods, and, of course, the print () method is used to output the result;
4. all the steps made before starting spark streaming just create the execution process, the program does not actually connect to the data source, and does not do anything to the data, just set up all the execution plans, when Ssc.start () The program does not actually perform all the expected actions after it is started.
Run as a data server using Netcat (a small utility found in most Unix-like systems)
9999
Start the sample
9999
Any rows that you type in the terminal that is running the Netcat server are then counted and printed on the screen every second
Learning Essays--sparkstreaming WordCount Java implementation