What is 1.Spark streaming?
Spark Streaming is a framework for scalable, high-throughput, real-time streaming data built on spark that can come from a variety of different sources, such as KAFKA,FLUME,TWITTER,ZEROMQ or TCP sockets. In this framework, various operations that support convective data, such as Map,reduce,join, are supported. The processed data can be stored in the file system or database.
With spark streaming, you can create data pipelines using the same APIs as bulk load data and process streaming data through the data pipeline. In addition, Spark steaming's "micro-batching" approach provides a fairly good elasticity to meet some of the causes of task failures.
2. The basic principles of Spark streaming
The main approach to data processing by Spark streaming is to take time slices of stream data into small pieces of data and process the pieces of data in a batch-like manner.
The Spark streaming splits the real-time input data stream into chunks in time slices ΔT (such as 1 seconds). Spark streaming each piece of data as an RDD and uses the RDD operation to process every piece of data.
Spark streaming breaks down streaming calculations into a series of short batch jobs. The input data for spark streaming is divided into a segment of data (dstreaming), each piece of data converted to the RDD in Spark , and the spark The operation of Dstream in streaming changes to the rdd operation in Spark , and the Rdd is manipulated into intermediate results to be stored in memory.
3. DStream
The above mentioned dstreaming, then what is the dstreaming exactly:
Dstreaming is equivalent to encapsulating the RDD in a streaming framework, representing a real-time data stream we are working on. Similar to Rdd,dstream provides three methods of operation for conversion operations, window conversion operations, and output operations.
Advantages of 4.Spark Streaming
Spark streaming is a real-time computing framework built on spark that extends the ability of spark to handle large-scale streaming data.
real-time: It can run on 100+ nodes and achieve second-level latency. spark streaming the flow calculation into multiple spark Job, for each piece of data processing will go through spark task set. batch size selection between 0.5~2 seconds (100ms, so spark streaming can meet all streaming quasi-real-time computing scenarios with very high real-time requirements.
Efficient and fault-tolerant features < Span lang= "en-US" > : Fault tolerance is critical for streaming computing. Every rdd in Spark is an immutable, distributed, and re-calculated dataset that records deterministic operations as long as the input data is fault tolerant, and any rdd partition error or unavailable, can be recalculated with the original input data through the conversion operation. While Spark streaming
throughput:Spark streaming can integrate Spark's batch and interactive queries with throughput of at least 2~5 times times higher than Storm . And it provides a simple interface similar to batch processing for implementing complex algorithms.
Next, use spark streaming to connect to the TCP socket to illustrate how to use the spark streaming:
1 Creating a StreamingContext Object
First, using the StreamingContext module, the function of this module is to provide all the flow data processing functions:
1 fromPysparkImportSparkcontext2 fromPyspark.streamingImportStreamingContext3 4sc = Sparkcontext ("Local[2]","Streamwordcount")5 #creates a local Sparkcontext object that contains 2 execution threads6 7SSC = StreamingContext (SC, 2)8 #creates a local StreamingContext object, processing the time slice interval, set to 0.5s
2 Creating a Dstream Object
We need to connect an open TCP service port to get the stream data, the source used here is the TCP Socket, so use the Sockettextstream () function:
Lines = Ssc.sockettextstream ("localhost", 8888)# Create Dstream, Indicates that the data source is socket: 8888 port from localhost native
3 Working with the Dstream
We began processing the lines, first dividing the data obtained in the current 0.5 seconds and performing the standard mapreduce process calculations.
Words = Lines.flatmap (lambda line:line.split (""))# Use Flatmap and split to split a string received within 0.5 seconds
The resulting words is a series of words, then perform the following actions:
Pairs = Words.map (lambda Word: (Word, 1))# Map Operation Maps Independent words to (word,1) tuples = Pairs.reducebykey (lambda x, y:x + y)# Reducebykey operates on pairs to perform a reduce operation obtained (word, Word frequency) tuples
5 Output data
Output the processed data to a file:
" /home/feige/streaming/ss " # output folder prefix, Spark streaming automatically uses the current timestamp to generate a different folder name wordcounts.saveastextfiles (outputFile) # Output The result
6 Launching the app
To make the program run on spark streaming, you need to perform the spark streaming START process, call the start()
function to start, and awaitTermination()
the function waits for the signal to be processed.
# launch spark streaming application ssc.awaittermination ()
Open Terminal execution:
NC-LK 8888
The NC -l
parameter indicates that a listening port is created, waiting for a new connection. -k
The parameter indicates that the current connection remains listening and must be -l
used in conjunction with the parameter.
After executing the above command, do not close the terminal, we will enter some processing data in this terminal:
Open a new terminal to perform our spark streaming application:
This is the process that spark streaming executes.
Now let's look at the effect of the program execution, the program scans the input of the monitoring window every 2 seconds, we look at:
Conclusion:
Recently, the pressure is larger, trivial, I believe this period of time after all will be better, refueling!!!
Spark streaming connect a TCP Socket