"Note" This series of articles, as well as the use of the installation package/test data can be in the "big gift –spark Getting Started Combat series" get
1 Spark Streaming Introduction
1.1 Overview
Spark Streaming is an extension of the Spark core API that enables the processing of high-throughput, fault-tolerant real-time streaming data. Support for from a variety of data sources, including KAFK, Flume, Twitter, ZeroMQ, Kinesis, and TCP sockets, after acquiring data from a data source, you can use maps, reduce, Advanced functions such as join and window handle complex algorithms. Finally, the processing results can be stored in the file system, database and field dashboards. On the basis of "one Stack rule them all", you can also use the other child frames of spark, such as cluster learning, graph calculation, and so on, to process the convection data.
Spark streaming processing of the streaming diagram:
The various sub-frameworks of Spark, all based on Core spark, spark streaming's internal processing mechanism is to receive real-time streaming data, split into batches of data based on a certain interval of time, and then process the batch data through Spark engine, Finally, a batch of results data is obtained after processing.
The corresponding batch data corresponds to an RDD instance in the spark kernel, so the dstream of the corresponding stream data can be regarded as a set of Rdds, which is a sequence of the RDD. Popular point of understanding, in the flow of data into a batch, through a first-out queue, and then Spark engine from the queue to take out a batch of data, the batch of data encapsulated into an RDD, and then processed, which is a typical producer consumer model, Correspondingly, there is the problem of producer consumer model, that is, how to coordinate production rate and consumption rate.
1.2 Terminology Definitions
- discrete stream (discretized stream) or Dstream: This is the abstract description of the continuous, real-time data stream from spark streaming, a real-time data stream We're working on, in spark The streaming corresponds to a dstream instance.
- Batch Data: This is the first step of the piecemeal, in batches of live streaming data in time slices, converting stream processing to batch processing of time slice data. As the duration progresses, these processing results result in a corresponding result data stream.
- Time slice or batch processing interval (batch interval): This is a quantitative standard for the flow of data artificially, taking time slices as the basis for our split data. The data for a time slice corresponds to an RDD instance.
- window Length: The length of time the stream data is overwritten by a window. Must be a multiple of the batch time interval,
- sliding interval : The length of time elapsed from the previous window to the latter window. Must be a multiple of the batch processing time interval
- input DStream : An input DStream is a special DStream that connects spark streaming to an external data source to read data.
Comparison of 1.3 storm and Spark streming
-
- processing models and delays
Although both frameworks provide extensibility (scalability) and tolerance (fault tolerance), their processing models are fundamentally different. Storm can handle sub-second latency, and only one Event at a time, and spark streaming can handle multiple (batches) event in a short window of time. So storm can deal with sub-second delay, while spark streaming has a certain delay.
-
- fault tolerance and data assurance
However, the cost of both is a fault-tolerant data guarantee, and Spark streaming's fault tolerance provides better support for stateful computing. In storm, each record needs to be tagged for tracking while the system is moving, so storm can only guarantee that each record is processed at least once, but is allowed to be processed multiple times when recovering from an error state. This means that the changeable state can be updated two times, resulting in incorrect results.
On either hand, the Spark streaming only needs to track records at the batch level, so he can guarantee that each batch record is processed only once, even if the nodes node is dead. Although Storm's Trident library can guarantee that a record is processed once, it relies on the transaction update state, which is slow and needs to be implemented by the user.
-
- implementation and programming APIs
Storm is primarily implemented by the Clojure language, and the Spark streaming is implemented by Scala. If you want to see how these two frameworks are implemented, or if you want to customize something, you have to remember that. Storm was developed by Backtype and Twitter, and spark streaming was developed in UC Berkeley.
Storm provides Java APIs and also supports APIs in other languages. Spark streaming supports Scala and the Java language (which in fact supports Python).
-
- batch-processing framework integration
One of the great features of spark streaming is that it runs on the spark framework. This way you can use other batch code to write the spark streaming program, or interactively query in spark. This reduces the flow batch process and historical data processing programs that are written separately.
-
- Production Support
Storm has been around for years and has been used in Twitter's internal production environment since 2011, as well as other companies. Spark Streaming is a new project and is only used by Sharethrough in 2013 (as the author knows).
Storm is the solution for streaming hortonworks Hadoop data platforms, and spark streaming appears in MapR's distributed platform and Cloudera's enterprise data platform. In addition, Databricks is a company that provides technical support for spark, including the spark streaming.
While both can run in their own cluster framework, Storm can run on Mesos, while spark streaming can run on yarn and Mesos.
2 Operating principle
2.1 Streaming architecture
Sparkstreaming is a high-throughput, fault-tolerant streaming system for real-time data streams that can perform complex operations like map, reduce, and join for a variety of data sources such as Kdfka, Flume, Twitter, Zero, and TCP sockets. and save the results to an external file system, database, or application to a real-time dashboard.
-
-
calculation Flow : Spark streaming is the decomposition of streaming calculations into a series of short batch jobs. The batch engine here is Spark Core, which divides the input data of spark streaming into a piece of data (discretized Stream) in batch size (for example, 1 seconds). Each piece of data is converted to the RDD (resilient distributed Dataset) in Spark, and the spark The transformation operation of Dstream in streaming becomes a transformation operation on the RDD in Spark, and the RDD is manipulated into intermediate results in memory. The entire streaming calculation can be superimposed on the intermediate results or stored on an external device, depending on the needs of the business. Shows the entire process of the spark streaming.
-
-
Fault tolerance : Fault tolerance is critical for streaming computing. First we need to clarify the fault tolerance mechanism of the RDD in Spark. Each RDD is an immutable distributed, reconfigurable data set that records deterministic operational inheritance (lineage), so that if the input data is fault tolerant, then any one RDD partition (Partition) is faulted or unavailable. Can be recalculated using the original input data through the conversion operation.
For spark streaming, the legacy of the RDD is as shown, each ellipse in the figure represents an rdd, and each circle in the ellipse represents a partition in an RDD, The multiple RDD for each column in the diagram represents a Dstream (three dstream in the figure), and the last rdd for each row represents the intermediate result Rdd produced by each batch size. We can see that each of the RDD in the diagram is connected via lineage, because the spark streaming input data can come from the disk, such as HDFS (multiple copies) or the data stream from the network (Spark Streaming will be able to copy each data stream of the network input data two copies to other machines to ensure fault tolerance, so any partition error in the RDD, can be parallel to the other machine on the missing partition calculated. This fault-tolerant recovery method is more efficient than a continuous computing model, such as Storm.
-
-
Real-time : For a real-time discussion, a streaming framework scenario is involved. Spark streaming decomposes streaming calculations into multiple spark jobs, and the processing of each piece of data goes through the Spark Dag graph decomposition and the scheduling of Spark's task set. For the current version of Spark streaming, the smallest batch size is selected between 0 and 5-2 seconds (Storm's current minimum delay is around 100ms), so spark The streaming is capable of satisfying all streaming quasi-real-time computing scenarios except for very high real-time requirements such as high-frequency real-time trading.
-
- Scalability and throughput : Spark is now able to scale linearly to 100 nodes (4Core per node) on EC2, and can handle 6gb/s of data (60M records/s) with a few seconds of delay, and its throughput is 2~5 times higher than popular storm, Figure 4 is a test done by Berkeley using WordCount and grep Two use cases, where the throughput of each node in the Spark streaming is 670k records/s in grep, and Storm is 115k records/s.
2.2 Programming model
DStream (discretized stream) is the underlying abstraction of spark streaming, which represents a continuous stream of data. These streams can be obtained either through an external input source or through an existing Dstream transformation operation. On an internal implementation, the Dstream is represented by a sequential rdd on a set of time series. Each rdd contains a stream of data within its own specific time interval. As shown in.
The various operations on the data in Dstream are also mapped to the internal RDD, as shown in the Dtream operation can generate a new dstream through the transformation of the RDD. The execution engine here is spark.
2.2.1 How to use spark streaming
As an application framework built on spark, spark streaming inherits the spark's programming style and is quick to get started for users who already know about spark. Next, take the example of the WordCount code provided by spark streaming to describe how spark streaming is used.
Import Org.apache.spark._import Org.apache.spark.streaming._import org.apache.spark.streaming.streamingcontext._/ /CreateALocalStreamingContext withWorking thread andBatchinterval of 1 Second.//The Master requires2Cores toPrevent fromA starvation Scenario.val conf = new sparkconf (). Setmaster ("local[2]"). Setappname ("Networkwordcount") Val SSC = new StreamingContext (conf, Seconds (1))//CreateA DStream that wouldConnect toHostname:port, likelocalhost9999Val lines = Ssc.sockettextstream ("localhost",9999)//Split eachLine intoWordsval words = Lines.flatmap (_.split (" ")) Import org.apache.spark.streaming.streamingcontext._//Count eachWordinch eachBatchval pairs = Words.map (Word,1) Val wordcounts = Pairs.reducebykey (_ + _)//Print the FirstTen elements of eachRDD generatedinchThis DStream toThe Consolewordcounts.print () SSC.Start() //StartThe computationssc.awaittermination ()//Wait forThe computation toTerminate
- Creating a StreamingContext object As with spark initialization requires creating a Sparkcontext object, and using spark streaming you need to create a StreamingContext object. The parameters required to create the StreamingContext object are basically consistent with the Sparkcontext, including specifying master, setting the name (such as Networkwordcount). Note that the parameter seconds (1), spark streaming need to specify the time interval to process the data, as shown in the example above 1s, then spark streaming will be 1s as the time window for data processing. This parameter needs to be set appropriately according to the user's requirement and the processing ability of the cluster;
- Creating inputdstream like storm Spout,spark streaming need to indicate the data source. As shown in the example above, Sockettextstream,spark streaming reads data as a socket connection as a data source. Of course, spark streaming supports a variety of different data sources, including Kafka, Flume, Hdfs/s3, Kinesis, and Twitter.
- Operation Dstream for the Dstream from the data source, the user can perform various operations on top of it, as shown in the example above is a typical wordcount execution process: The data obtained from the data source in the current time window is first segmented, Then using the map and Reducebykey method to calculate, of course, and finally the use of the print () method to output the results;
- All the steps before starting spark streaming just create the execution flow, the program does not actually connect to the data source, does nothing about the data, just sets up all the execution plans, and when Ssc.start () starts, the program really does all the expected actions.
At this point we have a general idea of how spark streaming is used, and in the later chapters we will delve into the execution of spark streaming through the source code.
2.2.2 Input source for Dstream
All operations in spark streaming are flow-based, and the input source is the starting point for this series of operations. Input dstreams and dstreams receive streams that represent the source of the input data stream, with two built-in data stream sources in spark streaming:
- Sources that are directly available from the underlying source in the StreamingContext API. For example: File system, socket (socket) connection and Akka actors;
- advanced sources such as Kafka, Flume, Kinesis, Twitter, etc., can be created with additional utility classes.
2.2.2.1 Base Source
In the previous example of how to use spark streaming, we have seen the Ssc.sockettextstream () method, which allows you to create a DStream from text data through a TCP socket connection. In addition to sockets, the StreamingContext API also provides methods for creating dstreams as input sources from files and Akka actors.
Spark Streaming provides the Streamingcontext.filestream (DataDirectory) method to read data from files in any file system (such as HDFS, S3, NFS, and so on) and then create a dstream. Spark Streaming monitors the DataDirectory directory and any files in that directory are created and processed (it is not supported to write files in nested directories). It should be noted that the read must be a file with the same data format, the created file must be in the DataDirectory directory, and by automatically moving or renaming the data directory, once the file is moved can not be changed, if the file is constantly appended, the new data will not be read. For a simple text, you can use a simple method Streamingcontext.textfilestream (dataDirectory) to read the data.
Spark streaming can also create dstream based on a custom Actors stream, accept data streams through Akka Actors, use method Streamingcontext.actorstream (Actorprops, Actor-name).
Spark Streaming uses the Streamingcontext.queuestream (Queueofrdds) method to create an RDD queue-based DStream, and each RDD queue is treated as a piece of data stream in DStream.
2.2.2.2 Advanced Sources
This type of source requires an interface to an external Non-spark library, some of which have complex dependencies (such as Kafka, Flume). Therefore, creating dstreams from these sources requires a clear dependency. For example, if you want to create a dstream stream of data that uses Twitter tweets, you must follow these steps:
1) Add spark-streaming-twitter_2.10 dependency in SBT or Maven project.
2) Development: Import Twitterutils Package, create a Dstream by Twitterutils.createstream method.
3) Deploy: Add all dependent jar packages (including dependent spark-streaming-twitter_2.10 and their dependencies) and deploy the application.
It is important to note that these advanced sources are generally not available in the spark shell, so applications based on these advanced sources cannot be tested in the spark shell. If you have to use them in the spark shell, you need to download the corresponding MAVEN project's jar dependencies and add them to the classpath.
Some of these high-level sources are:
- Twitter Spark Streaming's Twitterutils tool class uses the TWITTER4J,TWITTER4J library to support any method of providing authentication information that you can get from the public stream, or get a filter stream based on keywords.
- Flume Spark streaming can accept data from the flume.
- Kafka Spark streaming can accept data from the Kafka.
- Kinesis Spark streaming can accept data from the kinesis.
One thing to reiterate is that before you start writing your own sparkstreaming program, be sure to add a jar of high-level source dependencies to the appropriate artifact for your SBT or Maven project. The common input source and its corresponding jar package are shown in.
In addition, the input Dstream can also create a custom data source, what needs to be done is to implement a user-defined receiver.
Operation of 2.2.3 Dstream
Like the Rdd, Dstream also offers its own set of operations, which can be divided into three categories: normal conversion operations, window conversion operations, and output operations.
2.2.3.1 Normal conversion operations
The normal conversion operation is shown in the following table:
In these operations listed above, the transform () method and the Updatestatebykey () method are worth exploring in depth:
- Transform (func) Operation
The transform operation (conversion operation), along with its similar transformwith operation, allows any rdd-to-rdd function to be applied on the dstream. It can be applied to any RDD operation that is not exposed in the DStream API. For example, the ability to connect the data flow to another dataset in each batch is not directly exposed to the Dstream API, but it is easy to use the transform operation to do this, which makes Dstream very powerful. For example, you can filter by connecting the input data stream of pre-computed spam information (which may also be generated by Spark), and then filtering on this for real-time data cleansing, as shown in the official pseudo-code provided below. In fact, machine learning and graph computing algorithms can also be used in the transform method.
- Updatestatebykey Operation
This Updatestatebykey operation allows you to stay in any state while constantly having new information updated. To use this feature, you must take two steps:
(1) Define state-the state can be any data type.
(2) Define the state update function-a function that specifies how the state is updated with the previous state and the new value obtained from the input stream.
Let's illustrate with an example, suppose you want to do a word count in the text stream. Here, the running count is the state and it is an integer. We have defined the following update capabilities:
This function is applied to dstream that contain key-value pairs (as in the previous example, in Dstream with (word,1) key-value pairs). It calls the update function for every element inside (such as word in wordcount), NewValues is the most recent value, and Runningcount is the previous value.
2.2.3.2 Window Conversion Operations
Spark Streaming also provides a window calculation that allows you to convert the data through a sliding window, and the window conversion operation is as follows:
For a window operation, there will be n batches of data inside its window, the size of the batch data is determined by the window interval (Windows duration), and the window interval refers to the duration of the window, in which only the length of the window is satisfied to trigger the processing of the batch data. In addition to the length of the window, the window operation has another important parameter is the sliding interval (slide duration), which refers to how long the window is sliding once to form a new window, the sliding window by default and batch interval is the same, and the window interval is generally set to two larger than them. One thing to note here is that the sliding interval and the size of the window interval must be set to an integer multiple of the batch interval.
As shown in the batch interval, the batch interval is 1 units of time, the window interval is 3 units of time, and the sliding interval is 2 units of time. For the initial window time 1-time 3, only the window interval is satisfied to trigger the processing of the data. One thing to note here is that the initial window is likely to flow into the data that is not full, but as time progresses, the window will eventually be filled. When each 2 time unit, the window slides once, there will be new data into the window, then the window will remove the earliest two time units of data, and with the latest two time units of data summarized form a new window (TIME3-TIME5).
For window operations, batch interval, window interval, and slip interval are important three time concepts and are key to understanding window operations.
2.2.3.3 Output operation
Spark streaming allows dstream data to be exported to an external system, such as a database or file system. Because the output operation actually enables the data after the transformation operation to be used by the external system, the output operation triggers the actual execution of all Dstream transformation operations (similar to the RDD operation). The following table lists the main output operations currently:
The Dstream.foreachrdd is a very powerful output operation that allows data to be output to an external system. However, it is important to use this operation correctly and efficiently, and the following shows how to avoid some common mistakes.
Typically writing data to an external system requires creating a connection object (such as a TCP connection to a remote server) and using it to send data to the remote system. For this purpose, the developer may inadvertently create a connection object on the spark driver side and try to use it to save the records from the RDD to the spark worker, as in the following code:
This is not true, which requires the connection object to be serialized and sent from the driver side to the worker. Connection objects rarely do this between different machines, which can manifest as serialization errors (connection pairs are not serializable), initialization errors (the connection object needs to be initialized on the worker), and so on, and the correct workaround is to create the connection object on the worker.
Typically, creating a connection object has a time and resource overhead. Therefore, the connection object for each record created and destroyed can incur unnecessary resource overhead and significantly reduce the overall throughput of the system. A better solution is to use the Rdd.foreachpartition method to create a separate connection object and then use the data from all the RDD partitions that the connection object outputs to the external system.
This eases the overhead of creating multiple record connections. Finally, you can further optimize by reusing connection objects on multiple rdds/batches. A static pool that maintains a connection object can be reused on multiple batches of RDD to output it to an external system, further reducing overhead.
It is important to note that connections in a static pool should be created on demand, so that data can be sent to external systems more efficiently. It is also important to note that dstreams delays execution, just as the RDD operation is triggered by actions. By default, output operations are performed in the order in which they are defined in the streaming application.
2.3 Fault tolerance, persistence, and performance tuning
2.3.1 Fault Tolerance
Dstream based on the rdd composition, the RDD fault tolerance is still valid, we first recall the basic characteristics of Sparkrdd.
-
- The RDD is an immutable, deterministic, and repeatable distributed data set. Some partition of the Rdd are lost and can be recalculated by descent (lineage) information;
-
- If any of the RDD partitions are lost due to a worker node failure, the partition can be recovered from the original fault-tolerant data set;
- Since the conversion of all data in Spark is based on the RDD, even if the cluster fails, all intermediate results can be computed as long as the input dataset exists.
Spark streaming can read data from file systems such as HDFs and S3, in which case all data can be recalculated without worrying about data loss. However, in most cases, the Spark streaming is based on the network to accept the data, at this time in order to achieve the same fault-tolerant processing, the data on the network will be accepted in the cluster data replication between the multiple worker nodes (the default number of copies is 2), This results in two types of data being processed in the event of a failure:
1)Data received and replicated: Once a worker node fails, the system is recalculated from another copy of the data that still exists.
2)Data received but buffered for replication: Once data is lost, data can be read from an external file system such as HDFs through a dependency between the RDD.
In addition, there are two types of failures that we should care about:
(1)worker node Failure: We know from the above that the system chooses whether to recalculate from another working node that has replicated data, or to read data directly from the external file system, depending on the type of data that failed.
(2)Driver (drive node) failure: If the drive node fails while running the Spark streaming application, it is clear that the StreamingContext has been lost while the data in memory is all lost. In this case, the spark streaming application has an intrinsic structure in the calculation-the same spark calculation is performed periodically on each micro-batch data. This structure allows the application's state (also known as checkpoint) to be periodically stored in a reliable storage space and restored when the driver restarts. The practice is to set in the Ssc.checkpoint () function, and Spark streaming periodically writes Dstream's meta information to HDFs, and once the drive node fails, The missing StreamingContext will be restored by the checkpoint information that has been saved.
Finally, let's talk about the fault tolerance of spark stream for some improvements in the Spark 1.2 release:
The real-time streaming system must be able to work within 24/7 hours, so it needs to be able to recover from a variety of system failures. Initially, Sparkstreaming supports the ability to recover from driver and worker failures. However, the input of some data sources can lose data after a failure recovery. In the Spark1.2 release, Spark has preliminary support for pre-written logs (also known as journaling) in sparkstreaming, improving the recovery mechanism and making the 0 data loss of more data sources reliable.
For source data such as files, the driver recovery mechanism is sufficient for 0 data loss because all of the data is stored in a fault-tolerant file system such as HDFs or S3. But for other data sources like Kafka and Flume, some of the received data is only cached in memory and is not yet processed, and they can be lost. This is due to the way that the spark application distributes operations. When the driver process fails, all executor running in the Standalone/yarn/mesos cluster, along with all of their in-memory data, are also terminated. For spark streaming, all data received from data sources such as Kafka and Flume have been slowed down in executor memory until they are processed. Even though the driver restarts, the cached data cannot be restored. In order to avoid this data loss, the pre-write log (writeaheadlogs) feature was introduced in the Spark1.2 release version.
The process for pre-write logging is: 1) When a sparkstreaming application starts (that is, when driver starts), the associated StreamingContext uses Sparkcontext to start the receiver as a long-running task. These receivers receive and save streaming data to spark memory for processing. 2) The receiver notifies the driver. 3) The metadata in the receive block (metadata) is sent to the StreamingContext of the driver. This metadata includes: (a) The block Referenceid that locates its data in executor memory, and (b) The offset information (if enabled) of the block data in the log.
The user transmits the data life cycle as shown.
Systems such as Kafka can maintain reliability by replicating data. Allow pre-write logs to replicate the same data efficiently two times: once by Kafka, and another by sparkstreaming. A future version of Spark will contain native support for the Kafka fault tolerant mechanism, thus avoiding the second log.
2.3.2 Persistence
Like the Rdd, Dstream is also able to store the data stream in memory through the persist () method, the default persistence is Memory_only_ser, that is, the way in which the data is serialized simultaneously in memory, the benefit is that when you encounter a program that requires multiple iterations of the computation, The speed advantage is very obvious. For some window-based operations, such as Reducebywindow, Reducebykeyandwindow, and state-based operations such as Updatestatebykey, the default persistence strategy is to keep in memory.
For data sources from the network (Kafka, Flume, sockets, etc.), the default persistence strategy is to save the data on two machines, which is also designed for fault tolerance.
In addition, for Windows and stateful operations must be checkpoint, through the StreamingContext checkpoint to specify the directory, through the checkpoint of Dtream to specify the interval time, the interval must be the sliding interval (slide Interval) in multiples.
2.3.3 Performance Tuning
1. Optimize Run time
- increasing the degree of parallelism ensures that the resources of the entire cluster are used, rather than concentrating the tasks on several specific nodes. For operations that contain shuffle, increase their degree of parallelism to ensure that cluster resources are used more fully;
- reduce the burden of data serialization, deserialization Spark Streaming stores the data that is received by default after serialization to reduce memory usage. However, serialization and deserialization sessions require more CPU time, so a more efficient serialization approach (Kryo) and a custom serialization interface can be used more efficiently by the CPU;
- set up a reasonable batch duration (between batch processing time) in spark streaming, there may be dependencies between jobs, and the following job must ensure that the previous job execution is completed before it can be committed. If the previous job execution time exceeds the batch processing interval, then the latter job will not be submitted on time, which will further delay the next job, causing the subsequent job to block. Therefore, a reasonable batch interval is set to ensure that the job can be completed at the end of the batch interval;
- reduce the burden of task submission and distribution Typically, the Akka framework effectively ensures that tasks are distributed in a timely manner, but delays in submission and distribution tasks become unacceptable when the batch interval is very small (500ms). Using the standalone and coarse-grained mesos modes typically has a smaller delay than using fine-grained Mesos mode.
2. Optimize Memory usage
- Control Batch size (amount of data during batch interval) Spark streaming will store all data received at the batch interval within the available memory area within spark, so you must ensure that the current node's available memory in Spark is less able to accommodate all of the data in the batch interval, otherwise you must add new resources to increase the processing power of the cluster ;
- clean up data that is no longer in use in a timely manner , the spark streaming will store all the accepted data in the internal available memory area, so the data that is no longer needed should be cleaned up in time to ensure that spark streaming has a surplus of available memory space. By setting a reasonable spark.cleaner.ttl time to clean up the time-out useless data, this parameter needs to be carefully set to avoid the subsequent operation of the required data is timed out error processing;
- observe and properly adjust GC strategies GC can affect the job's normal operation, may prolong the job execution time, cause a series of unpredictable problems. Observe the GC's operation and adopt different GC policies to further reduce the impact of memory reclamation on job operation.
References:
(1) "Spark streaming" http://blog.debugo.com/spark-streaming/
Copyright NOTICE: This article for Bo Master original article, without Bo Master permission not reproduced.
Spark Starter Combat Series--7.spark Streaming (top)--real-time streaming computing Spark streaming Introduction