SOURCE Link: Spark streaming: The upstart of large-scale streaming data processing
Summary: Spark Streaming is the upstart of large-scale streaming data processing, which decomposes streaming calculations into a series of short batch jobs. This paper expounds the architecture and programming model of spark streaming, and analyzes its core technology with practice, and gives the concrete application scenario and optimization scheme.
Referring to spark streaming, we have to say Bdas (Berkeley data analytics stack), the Berkeley University's software stack on data analysis. From its point of view, the current big data processing can be divided into the following three types.
- Complex bulk data processing (batch data processing) that typically spans between 10 minutes and several hours.
- An interactive query based on historical data (interactive query), which typically spans between 10 seconds and a few minutes.
- Data processing based on real-time data streams (streaming data processing) typically spans between hundreds of milliseconds and a few seconds.
There are many relatively mature open source software to deal with the above three scenarios, we can use MapReduce for batch data processing, can use Impala for interactive query, for streaming data processing, we can use storm. For most Internet companies, it is common for these three scenarios to be encountered at the same time, and these companies may experience the following inconvenience in the course of their use.
- The input and output data for three scenarios cannot be shared seamlessly, and the format is converted to each other.
- Every open source software requires a development and maintenance team that increases costs.
- It is difficult to coordinate resource allocation in the same cluster for each system.
Bdas is a set of software stacks based on spark, using a common memory-based computing model to clean sweep the above three scenarios while supporting batch, Interactive, streaming processing, and compatible with distributed file systems such as HDFs and S3. Can be deployed on top of popular cluster resource managers such as yarn and Mesos. Bdas is shown in Architecture 1, where spark can replace mapreduce for batch processing, leveraging its memory-based features, particularly adept at iterative and interactive data processing, and shark SQL queries for large-scale data, compatible with hive HQL. This article focuses on the spark streaming, which is a large-scale streaming process throughout the Bdas.
Figure 1 Bdas software stack
Spark Streaming Architecture
- calculation Flow : Spark streaming is the decomposition of streaming calculations into a series of short batch jobs. The batch engine here is spark, which divides the input data of spark streaming into a segment of data (discretized Stream) According to batch size (for example, 1 seconds). Each piece of data is converted to the RDD (resilient distributed Dataset) in Spark, and the spark The transformation operation of Dstream in streaming becomes a transformation operation on the RDD in Spark, and the RDD is manipulated into intermediate results in memory. The entire streaming calculation can be superimposed on intermediate results or stored on an external device, depending on the needs of the business. Figure 2 shows the entire process of the spark streaming.
Figure 2 Spark streaming architecture diagram
- Fault tolerance : Fault tolerance is critical for streaming computing. First we need to clarify the fault tolerance mechanism of the RDD in Spark. Each RDD is an immutable distributed, reconfigurable data set that records deterministic operational inheritance (lineage), so that if the input data is fault tolerant, then any one RDD partition (Partition) is faulted or unavailable. Can be recalculated using the original input data through the conversion operation.
Figure 3 Lineage diagram of the Rdd in Spark streaming
- For spark streaming, its rdd inheritance relationship 3 shows that each ellipse in the figure represents an rdd, and each circle in the ellipse represents a partition in an RDD, The multiple RDD for each column in the diagram represents a Dstream (three dstream in the figure), and the last rdd for each row represents the intermediate result Rdd produced by each batch size. We can see that each of the RDD in the diagram is connected via lineage, because the spark streaming input data can come from the disk, such as HDFS (multiple copies) or the data stream from the network (Spark Streaming will ensure fault tolerance by copying two copies of each data stream from the network input data to other machines. So any partition error in the RDD can be computed in parallel to the missing partition on other machines. This fault-tolerant recovery method is more efficient than a continuous computing model, such as Storm.
- Real-time : For a real-time discussion, a streaming framework scenario is involved. Spark streaming decomposes streaming calculations into multiple spark jobs, and the processing of each piece of data goes through the Spark Dag graph decomposition, and the scheduling process for Spark's task set. For the current version of Spark streaming, the smallest batch size is selected between 0 and 5-2 seconds (Storm's current minimum delay is around 100ms), so spark The streaming is capable of satisfying all streaming quasi-real-time computing scenarios except for very high real-time requirements such as high-frequency real-time trading.
- Scalability and throughput : Spark is now able to scale linearly to 100 nodes (4Core per node) on EC2, and can handle 6gb/s of data (60M records/s) with a few seconds of delay, and its throughput is 2~5 times higher than popular storm, Figure 4 is a test done by Berkeley using WordCount and grep Two use cases, where the throughput of each node in the Spark streaming is 670k records/s in grep, and Storm is 115k records/s.
Figure 4 Spark Streaming vs Storm throughput comparison diagram
the programming model for Spark streaming
Spark Streaming's programming is the same as Spark's programming, and the understanding of programming is very similar. For spark, programming is the operation of the RDD, and for spark streaming, it is the operation of Dstream. The following is an example of a familiar wordcount that illustrates the input, conversion, and output operations in spark streaming.
- Spark streaming initialization: The spark streaming needs to be generated StreamingContext before the Dstream operation is initiated. The more important parameters are the first and third, the first parameter is the cluster address that specifies the spark streaming run, and the third parameter is the size of the batch window that specifies the spark streaming runtime. In this example, the 1-second input data is processed at the spark job.
val SSC = new StreamingContext ("spark://...", "WordCount", Seconds (1), [Homes], [Jars])
- Spark Streaming input operation: Currently spark streaming has supported a rich input interface, broadly divided into two categories: one is disk input, such as batch size as a time interval to monitor the HDFs file system of a directory, The changes in the contents of the catalog are used as inputs to the spark streaming, and the other is the way the network flows, currently supporting Kafka, Flume, Twitter, and TCP sockets. In the WordCount example, it is assumed that a network socket is used as the input stream, listening to a specific port, and finally the input dstream (lines).
val lines = Ssc.sockettextstream ("localhost", 8888)
- Spark streaming conversion: Similar to the spark RDD operation, spark streaming converts one or more dstream to a new dstream through a conversion operation. Common operations include map, filter, Flatmap, and join, as well as groupbykey/reducebykey that require shuffle operations. In the WordCount example, we first need to cut dstream (lines) into words and then overlay the same number of words, and the resulting wordcounts is the intermediate result of each batch size (word, number).
val words = Lines.flatmap (_.split (""))
Val wordcounts = words.map (x = (x, 1)). Reducebykey (_ + _)
In addition, Spark streaming has a specific window operation, which involves two parameters: one is the width of the Slide window (window Duration) and the other is the frequency of the window sliding (Slide Duration), and the two parameters must be batch A multiple of size. For example, in the past 5 seconds as an input window, every 1 seconds to count the WordCount, then we will be the last 5 seconds of every second of the wordcount are counted, and then superimposed to draw the word statistics in this window.
val wordcounts = words.map (x = (x, 1)). Reducebykeyandwindow (_ + _, Seconds (5s), Seconds (1))
But the above approach is not efficient enough. If we calculate in an incremental way more efficiently, for example, to calculate the wordcount of the last 5 seconds window of the t+4 second, then we can add the statistic of the t+3 time past 5 seconds plus [t+3,t+4] statistics, minus [t-2,t-1] statistics (5), This method can be used to reuse the statistics of the middle three seconds to improve the efficiency of statistics.
val wordcounts = words.map (x = (x, 1)). Reducebykeyandwindow (_ + _, _-_, Seconds (5s), Seconds (1))
Figure 5 Overlay processing and incremental processing of sliding windows in Spark streaming
- Spark Streaming input operation: For output operations, Spark provides the printing of data to the screen and input into a file. In WordCount we enter the Dstream wordcounts into the HDFs file.
wordcounts = saveashadoopfiles ("WordCount")
- Spark streaming start: After the above operation, spark streaming is not working yet, we need to call start again, spark streaming start listening to the corresponding port, then collect the data and make statistics.
Ssc.start ()
Spark Streaming Case Study
In the Internet application, the website traffic statistics, as a common application pattern, need to statistic different data in different granularity, both real-time demand and complicated statistic demand of aggregation, deduplication and connection. Traditionally, if the Hadoop mapreduce framework is used, although it is easy to achieve more complex statistical requirements, but the real-time is not guaranteed, and if the use of storm such a flow framework, real-time can be guaranteed, but the requirements of the implementation of the complexity is greatly improved. Spark streaming has found a balance between the two and is able to easily implement more complex statistical requirements in a quasi-real-time manner. Here's a look at building a real-time traffic statistics framework using Kafka and spark streaming.
- Data staging: Kafka as a distributed message queue, both excellent throughput, and high reliability and scalability, where Kafka as a log delivery middleware to receive logs, crawl the traffic logs sent by the client, while accepting the spark streaming request, The traffic log is sent sequentially to the spark streaming cluster.
- Data processing: Connecting the spark streaming cluster to the Kafka cluster, spark streaming gets the traffic logs from the Kafka cluster and processes them. Spark streaming takes data from the Kafka cluster in real time and stores it in the internal free memory space. This data is processed when each batch window arrives.
- Result storage: To facilitate front-end presentation and page requests, the processed results are written to the database.
Compared to the traditional processing framework, the Kafka+spark streaming architecture has several advantages.
- The efficient and low latency of the spark frame guarantees the quasi-real-time performance of spark streaming operations.
- With the rich API and flexibility provided by the Spark framework, you can write more complex algorithms in a streamlined manner.
- The high consistency of the programming model makes it fairly easy to get started with spark streaming, while also ensuring that business logic is reused on real-time and batch-processing.
During the operation of the traffic statistic application based on Kafka+spark streaming, there are some problems such as insufficient memory and blocking of GC. Here's how to tune the spark streaming application to reduce or even avoid the effects of these issues.
Performance tuning
Optimized run time
- Increase the degree of parallelism. Be sure to use the resources of the entire cluster instead of concentrating the tasks on several specific nodes. For operations that contain shuffle, increase their degree of parallelism to ensure that cluster resources are used more fully.
- Reduce the burden of data serialization and deserialization. Spark Streaming By default stores the received data after serialization to reduce memory usage. But serialization and deserialization require more CPU time, so a more efficient serialization approach (Kryo) and a custom serialization interface can be used more efficiently by the CPU.
- Set up a reasonable batch window. In spark streaming, there may be dependencies between jobs, and the latter job must ensure that the previous job execution is completed before it can be committed. If the previous job execution time exceeds the Set batch window, then the job can not be submitted on time, which will further delay the next job, causing subsequent job blocking. Therefore, it is necessary to set up a reasonable batch window to ensure that the job can be completed in this batch window.
- Reduce the burden of task submission and distribution. Typically, the Akka framework effectively ensures that tasks are distributed in a timely manner, but when the batch window is very small (500ms), the delay in submitting and distributing tasks becomes unacceptable. Using the standalone mode and the coarse-grained Mesos mode will typically have a smaller delay than using fine-grained Mesos mode.
Optimize Memory usage
- Control Batch size. Spark streaming will store all the data received in the Batch window in the available memory area within spark, so you must ensure that the current node spark has at least the available memory to hold all the data in the Batch window. Otherwise, new resources must be added to improve the processing power of the cluster.
- Clean up data that is no longer in use. As mentioned above, spark streaming will store all the received data in the internal available memory area, so it should be cleaned up in time to ensure that spark streaming has a surplus of available memory space for the processed data that is no longer needed. Clean up time-out useless data by setting a reasonable spark.cleaner.ttl time.
- Observe and properly adjust the GC strategy. GC can affect the job's normal operation, prolong job execution time, and cause a series of unpredictable problems. Observe the operation of GC and take different GC policies to further reduce the impact of memory reclamation on job operation.
Summarize
Spark Streaming provides an efficient, fault-tolerant, quasi-real-time, large-scale streaming framework that can be placed in the same software stack as batch and instant queries, reducing learning costs. If you've learned spark programming, then you've learned spark streaming programming, and if you understand the scheduling and storage of spark, spark streaming is similar. Readers interested in open source software, we can contribute to the community together. Currently, Spark is already in the Apache incubator. According to the current development trend, Spark streaming will definitely be used in a larger range.
Spark Streaming: The upstart of large-scale streaming data processing (RPM)