1. Introduction to Spark streaming
1.1 Overview
Spark Streaming is an extension of the Spark core API that enables the processing of high-throughput, fault-tolerant real-time streaming data. Support for obtaining data from a variety of data sources, including KAFK, Flume, Twitter, ZeroMQ, Kinesis, and TCP sockets, after acquiring data from a data source, you can use maps, reduce, Advanced functions such as join and window handle complex algorithms. Finally, the processing results can be stored in the file system, database and field dashboards. On the basis of "one Stack rule them all", you can also use the other child frames of spark, such as cluster learning, graph calculation, and so on, to process the convection data.
Spark streaming processing of the streaming diagram:
The various sub-frameworks of Spark, all based on Core spark, spark streaming's internal processing mechanism is to receive real-time streaming data, split into batches of data based on a certain interval of time, and then process the batch data through Spark engine, Finally, a batch of results data is obtained after processing.
the corresponding batch data corresponds to an RDD instance in the spark kernel, so the dstream of the corresponding stream data can be regarded as a set of Rdds, which is a sequence of the RDD. Popular point of understanding, in the flow of data into a batch, through a first-out queue, and then Spark engine from the queue to take out a batch of data, the batch of data encapsulated into an RDD, and then processed, which is a typical producer consumer model, Correspondingly, there is the problem of producer consumer model, that is, how to coordinate production rate and consumption rate. 1.2 terminology definitions
L Discrete stream (discretized stream) or Dstream: This is the spark streaming's abstract description of the internal continuous real-time data stream, a real-time data stream We're working on, in spark The streaming corresponds to a dstream instance.
L Batch Data: This is the first step of the piecemeal, the real-time stream data is batched in times slices, converting the stream processing into batch processing of time slice data. As the duration progresses, these processing results result in a corresponding result data stream.
L time slice or batch processing interval (batch interval): This is a quantitative standard for the flow of data artificially, taking time slices as the basis for our split data. The data for a time slice corresponds to an RDD instance.
window Length: The length of time the stream data is overwritten by a window. Must be a multiple of the batch time interval,
L Sliding interval : The length of time elapsed from the previous window to the latter window. Must be a multiple of the batch processing time interval
LInput DStream : An input DStream is a special DStream that connects spark streaming to an external data source to read data. Comparison of 1.3 storm and spark streming
L processing models and delays
Although both frameworks provide extensibility (scalability) and tolerance (fault tolerance), their processing models are fundamentally different. Storm can handle sub-second latency, and only one Event at a time, and spark streaming can handle multiple (batches) event in a short window of time. So storm can deal with sub-second delay, while spark streaming has a certain delay.
L fault tolerance and data assurance
However, the cost of both is a fault-tolerant data guarantee, and Spark streaming's fault tolerance provides better support for stateful computing. In storm, each record needs to be tagged for tracking while the system is moving, so storm can only guarantee that each record is processed at least once, but is allowed to be processed multiple times when recovering from an error state. This means that the changeable state can be updated two times, resulting in incorrect results.
On either hand, the Spark streaming only needs to track records at the batch level, so he can guarantee that each batch record is processed only once, even if the nodes node is dead. Although Storm's Trident library can guarantee that a record is processed once, it relies on the transaction update state, which is slow and needs to be implemented by the user.
L implementation and programming API
Storm is primarily implemented by the Clojure language, and the Spark streaming is implemented by Scala. If you want to see how these two frameworks are implemented, or if you want to customize something, you have to remember that. Storm was developed by Backtype and Twitter, and spark streaming was developed in UC Berkeley.
Storm provides Java APIs and also supports APIs in other languages. Spark streaming supports Scala and the Java language (which in fact supports Python).
L Batch processing framework integration
One of the great features of spark streaming is that it runs on the spark framework. This way you can use other batch code to write the spark streaming program, or interactively query in spark. This reduces the flow batch process and historical data processing programs that are written separately.
L Production Support
Storm has been around for years and has been used in Twitter's internal production environment since 2011, as well as other companies. Spark Streaming is a new project and is only used by Sharethrough in 2013 (as the author knows).
Storm is the solution for streaming hortonworks Hadoop data platforms, and spark streaming appears in MapR's distributed platform and Cloudera's enterprise data platform. In addition, Databricks is a company that provides technical support for spark, including the spark streaming.
While both can run in their own cluster framework, Storm can run on Mesos, while spark streaming can run on yarn and Mesos. 2. Operating principle 2.1 streaming architecture
Sparkstreaming is a high-throughput, fault-tolerant streaming system for real-time data streams that can perform complex operations like map, reduce, and join for a variety of data sources such as Kdfka, Flume, Twitter, Zero, and TCP sockets. and save the results to an external file system, database, or application to a real-time dashboard.
L Calculation flow : Spark streaming is the decomposition of streaming calculations into a series of short batch jobs. The batch engine here is Spark Core, which divides the input data of spark streaming into a piece of data (discretized Stream) in batch size (for example, 1 seconds). Each piece of data is converted to the RDD (resilient distributed Dataset) in Spark, and the spark The transformation operation of Dstream in streaming becomes a transformation operation on the RDD in Spark, and the RDD is manipulated into intermediate results in memory. The entire streaming calculation can be superimposed on the intermediate results or stored on an external device, depending on the needs of the business. The following figure shows the entire process of the spark streaming.
Figure Spark Streaming architecture
L Fault tolerance : Fault tolerance is of paramount importance for streaming computing. First we need to clarify the fault tolerance mechanism of the RDD in Spark. Each RDD is an immutable distributed, reconfigurable data set that records deterministic operational inheritance (lineage), so that if the input data is fault tolerant, then any one RDD partition (Partition) is faulted or unavailable. Can be recalculated using the original input data through the conversion operation.
For spark streaming, the legacy of the RDD is shown in the figure below, where each ellipse represents an rdd, and each circle in the ellipse represents a partition in an RDD, The multiple RDD for each column in the diagram represents a Dstream (three dstream in the figure), and the last rdd for each row represents the intermediate result Rdd produced by each batch size. We can see that each of the RDD in the diagram is connected via lineage, because the spark streaming input data can come from the disk, such as HDFS (multiple copies) or the data stream from the network (Spark Streaming will be able to copy each data stream of the network input data two copies to other machines to ensure fault tolerance, so any partition error in the RDD, can be parallel to the other machine on the missing partition calculated. This fault-tolerant recovery method is more efficient than a continuous computing model, such as Storm.
Lineage diagram of the Rdd in Spark streaming
L Real-time : For the real-time discussion, the application scenario of streaming framework is involved. Spark streaming decomposes streaming calculations into multiple spark jobs, and the processing of each piece of data goes through the Spark Dag graph decomposition and the scheduling of Spark's task set. For the current version of Spark streaming, the smallest batch size is selected between 0 and 5-2 seconds (Storm's current minimum delay is around 100ms), so spark The streaming is capable of satisfying all streaming quasi-real-time computing scenarios except for very high real-time requirements such as high-frequency real-time trading.
Scalability and throughput : Spark is now able to scale linearly to 100 nodes (4Core per node) on EC2, and can handle 6gb/s of data (60M records/s) with a few seconds of delay, and its throughput is 2~5 times higher than popular storm, Figure 4 is a test done by Berkeley using WordCount and grep Two use cases, where the throughput of each node in the Spark streaming is 670k records/s in grep, and Storm is 115k records/s.
Spark Streaming vs Storm throughput comparison chart 2.2 Programming Model
DStream (discretized stream) is the underlying abstraction of spark streaming, which represents a continuous stream of data. These streams can be obtained either through an external input source or through an existing Dstream transformation operation. On an internal implementation, the Dstream is represented by a sequential rdd on a set of time series. Each rdd contains a stream of data within its own specific time interval. As shown in Figure 7-3.
Figure 7-3 Generation of discrete rdd sequences in the Dstream in the timeline
The various operations on the data in Dstream are also mapped to the internal RDD, as shown in Figure 7-4, the dtream operation can be generated through the RDD transformation new Dstream. The execution engine here is spark. 2.2.1 How to use spark streaming
As an application framework built on spark, spark streaming inherits the spark's programming style and is quick to get started for users who already know about spark. Next, take the example of the WordCount code provided by spark streaming to describe how spark streaming is used.
Import Org.apache.spark._
Import Org.apache.spark.streaming._
Import Org.apache.spark.streaming.streamingcontext._
Create a local streamingcontext with working thread and batch interval of 1 second.
The master requires 2 cores to prevent from a starvation scenario.
Val conf = new sparkconf (). Setmaster ("local[2]"). Setappname ("Networkwordcount")
Val SSC = new StreamingContext (conf, Seconds (1))
Create a DStream that would connect to Hostname:port, like localhost:9999
Val lines = Ssc.sockettextstream ("localhost", 9999)
Split each line into words
Val words = Lines.flatmap (_.split (""))
Import Org.apache.spark.streaming.streamingcontext._
Count each word in each batch
Val pairs = Words.map (Word = + (Word, 1))
Val wordcounts = Pairs.reducebykey (_ + _)
Print the first ten elements of each RDD generated in this DStream to the console
Wordcounts.print ()
Ssc.start ()//Start the computation
Ssc.awaittermination ()//Wait for the computation to terminate
1. Creating a StreamingContext object As with spark initialization requires creating a Sparkcontext object, you need to create a StreamingContext object using spark streaming. The parameters required to create the StreamingContext object are basically consistent with the Sparkcontext, including specifying master, setting the name (such as Networkwordcount). Note that the parameter seconds (1), spark streaming need to specify the time interval to process the data, as shown in the example above 1s, then spark streaming will be 1s as the time window for data processing. This parameter needs to be set appropriately according to the user's requirement and the processing ability of the cluster;
2. Create Inputdstream like storm Spout,spark streaming need to indicate the data source. As shown in the example above, Sockettextstream,spark streaming reads data as a socket connection as a data source. Of course, spark streaming supports a variety of different data sources, including Kafka, Flume, Hdfs/s3, Kinesis, and Twitter.
3. operation Dstream for the Dstream obtained from the data source, the user can perform various operations on its basis, The operation shown in the example above is a typical wordcount execution process: The data obtained from the data source in the current time window is first segmented, then computed using the map and Reducebykey methods, and, of course, the print () method is used to output the result;
4. all the steps made before starting spark streaming just create the execution process, the program does not actually connect to the data source, and does not do anything to the data, just set up all the execution plans, when Ssc.start ( The program does not actually do all the expected actions after it is started.
at this point we have a general idea of how spark streaming is used, and in the later chapters we will delve into the execution of spark streaming through the source code. 2.2.2 Input source for Dstream
All operations in spark streaming are flow-based, and the input source is the starting point for this series of operations. Input dstreams and dstreams receive streams that represent the source of the input data stream, with two built-in data stream sources in spark streaming:
L sources that are directly available from the underlying source in the StreamingContext API. For example: File system, socket (socket) connection and Akka actors;
L Advanced sources such as Kafka, Flume, Kinesis, Twitter, etc., can be created with additional utility classes. 2.2.2.1 Base Source
In the previous example of how to use spark streaming, we have seen the Ssc.sockettextstream () method, which allows you to create a DStream from text data through a TCP socket connection. In addition to sockets, the StreamingContext API also provides methods for creating dstreams as input sources from files and Akka actors.
Spark Streaming provides the Streamingcontext.filestream (DataDirectory) method to read data from files in any file system (such as HDFS, S3, NFS, and so on) and then create a dstream. Spark Streaming monitors the DataDirectory directory and any files in that directory are created and processed (it is not supported to write files in nested directories). It should be noted that the read must be a file with the same data format, the created file must be in the DataDirectory directory, and by automatically moving or renaming the data directory, once the file is moved can not be changed, if the file is constantly appended, the new data will not be read. For a simple text, you can use a simple method Streamingcontext.textfilestream (dataDirectory) to read the data.
Spark streaming can also create dstream based on a custom Actors stream, accept data streams through Akka Actors, use method Streamingcontext.actorstream (Actorprops, Actor-name). Spark Streaming uses the Streamingcontext.queuestream (Queueofrdds) method to create an RDD queue-based DStream, and each RDD queue is treated as a piece of data stream in DStream. 2.2.2.2 Advanced Sources
This type of source requires an interface to an external Non-spark library, some of which have complex dependencies (such as Kafka, Flume). Therefore, creating dstreams from these sources requires a clear dependency. For example, if you want to create a dstream stream of data that uses Twitter tweets, you must follow these steps:
1) Add spark-streaming-twitter_2.10 dependency in SBT or Maven project.
2) Development: Import Twitterutils Package, create a Dstream by Twitterutils.createstream method.
3) Deploy: Add all dependent jar packages (including dependent spark-streaming-twitter_2.10 and their dependencies) and deploy the application.
It is important to note that these advanced sources are generally not available in the spark shell, so applications based on these advanced sources cannot be tested in the spark shell. If you have to use them in the spark shell, you need to download the corresponding MAVEN project's jar dependencies and add them to the classpath.
Some of these high-level sources are:
LTwitter Spark Streaming's Twitterutils tool class uses the TWITTER4J,TWITTER4J library to support the use of any method to provide authentication information that you can get from the public stream, or get a filter stream based on keywords.
LFlume Spark streaming can accept data from Flume.
LKafka Spark streaming can accept data from Kafka.
LKinesis Spark streaming can accept data from Kinesis.
One thing to reiterate is that before you start writing your own sparkstreaming program, be sure to add a jar of high-level source dependencies to the appropriate artifact for your SBT or Maven project. The common input source and its corresponding jar package are shown in the following figure.
In addition, the input Dstream can also create a custom data source, what needs to be done is to implement a user-defined receiver. operation of 2.2.3 Dstream
Like the Rdd, Dstream also offers its own set of operations, which can be divided into three categories: normal conversion operations, window conversion operations, and output operations. 2.2.3.1 Normal conversion operations
The normal conversion operation is shown in the following table:
Conversion |
Description |
Map (func) |
Each element of the source Dstream returns a new Dstream through the function func. |
FlatMap (func) |
Similar to the map operation, the difference is that each INPUT element can be mapped out to 0 or more output elements. |
Filter (func) |
Selecting the Func function on the source Dstream returns an element that is true only and eventually returns a new Dstream. |
Repartition (numpartitions) |
Change the partition size of the Dstream by the value of the input parameter numpartitions. |
Union (Otherstream) |
Returns a new Dstream that contains the merged elements of the source Dstream and other dstream. |
Count () |
Counts the number of elements inside the source dstream that contain the RDD, and returns an internal RDD that contains only one element of the Dstreaam. |
Reduce (func) |
Using the function func (with two arguments and returning a result) aggregates the elements of each rdd in the source Dstream, returning a new dstream that contains only one element of the rdd inside. |
Countbyvalue () |
Calculates the frequency of occurrences of elements within each RDD in Dstream and returns a new dstream[(K,long)], where K is the type of element in the RDD, and Long is the frequency of the occurrence of the element. |
Reducebykey (func, [Numtasks]) |
When a DStream of type (K,V) key-value pair is called, the return type is a new DStream of type (K,V) key-value pairs, where the value V of each key is aggregated using the aggregate function func. Note: By default, the default parallelism submission task with Spark (the parallelism in local mode is 2, cluster mode lower 8), you can configure Numtasks to set different number of parallel tasks. |
Join (Otherstream, [numtasks]) |
When the called type is 2 Dstream of the (k,v) and (k,w) key-value pairs, the return type is a new dstream of (K, (v,w)) key-value pairs. |
Cogroup (Otherstream, [numtasks]) |
When the called Two Dstream contain (k, V) and (k, W) key-value pairs, a new Dstream is returned (K, Seq[v], seq[w]) type. |
Transform (func) |
A new Dstream is returned by applying the Rdd-to-rdd function to each rdd of the source Dstream, which can be used to do any RDD operation in Dstream. |
Updatestatebykey (func) |
Returns a new state of Dstream, where the state of each key is updated after the given function func is applied based on the new value of the previous state and key of the key. This method can be used to maintain any state data for each key. |
In these operations listed above, the transform () method and the Updatestatebykey () method are worth exploring in depth:
L Transform (func) Operation
The transform operation (conversion operation), along with its similar transformwith operation, allows any rdd-to-rdd function to be applied on the dstream. It can be applied to any RDD operation that is not exposed in the DStream API. For example, the ability to connect the data flow to another dataset in each batch is not directly exposed to the Dstream API, but it is easy to use the transform operation to do this, which makes Dstream very powerful. For example, you can filter by connecting the input data stream of pre-computed spam information (which may also be generated by Spark), and then filtering on this for real-time data cleansing, as shown in the official pseudo-code provided below. In fact, machine learning and graph computing algorithms can also be used in the transform method.
L Updatestatebykey Operation
This Updatestatebykey operation allows you to stay in any state while constantly having new information updated. To use this feature, you must take two steps:
(1) Define state-the state can be any data type.
(2) Define the state update function-a function that specifies how the state is updated with the previous state and the new value obtained from the input stream.
Let's illustrate with an example, suppose you want to do a word count in the text stream. Here, the running count is the state and it is an integer. We have defined the following update capabilities:
This function is applied to dstream that contain key-value pairs (as in the previous example, in Dstream with (word,1) key-value pairs). It calls the update function for every element inside (such as word in wordcount), NewValues is the most recent value, and Runningcount is the previous value.
2.2.3.2 window Conversion Operations
Spark Streaming also provides a window calculation that allows you to convert the data through a sliding window, and the window conversion operation is as follows:
convert |
description |
tr>
window (windowlength, slideinterval) |
Returns a window batch calculation based on source Dstream to get a new DStream. |
Countbywindow (windowlength,slideinterval) |
Return The number of elements in the dstream based on the sliding window. |
Reducebywindow (func, windowlength,slideinterval) | The
Aggregates the elements in the source Dstream based on a sliding window, resulting in a new dstream. |
Reducebykeyandwindow (Func,windowlength,slideinterval, [numtasks]) |
|