First, the cache or persistence
Like RDD, DStreams also allows developers to persist streaming data into memory. Use the persist () method on DStream to automatically persist RDDs in DStream into memory. This is useful if the data in DStream needs to be calculated more than once. Like window operations reduceByWindow and reduceByKeyAndWindow, updateStateByKey this state-based operation, the persistence is the default, do not need the developer to call the persist () method. For example, incoming data streams obtained over a network such as kafka, flume, etc. The default persistence strategy is to replicate data to two different nodes for fault tolerance. Note that unlike RDD, DStreams default persistence level is to store serialized data into memory.
Second, Checkpointing
A streaming application must be run around the clock and all must be able to address application-logic-related failures (such as system errors, JVM crashes, etc.). To make this possible, Spark Streaming needs enough checkpoint information in the fault-tolerant storage system to recover from the failure. Metadata checkpointing: saves the definition information of the flow calculation to a fault-tolerant storage system such as HDFS. This is used to recover the failure of the node on which the worker is running in the application. Metadata includes 1.Configuration: create Spark Streaming application configuration information 2.DStream operations: define the operation of the Streaming application set 3.Incomplete batches: the operation exists in the queue unfinished batch 4.Data checkpointing: save the generated RDD Into a reliable storage system, this is a must in stateful transformations (such as combining data across multiple batches). In such a transformation, the generated RDDs depend on the RDDs of the previous batch, and the length of this dependent chain continues to grow over time. In the process of recovery, in order to avoid this unlimited growth. The intermediate RDD with state transformation will be periodically stored in a reliable storage system to truncate this link. Metadata checkpoint is mainly to recover data from driver failure. If the transformation operation is used, the data checkpoint is required even in simple operations. When checkpoint: The application must have checkpoint turned on in the following two cases: 1. Use a stateful transformation. If you use updateStateByKey or reduceByKeyAndWindow in your application, the checkpoint directory must be provided for regular checkpoint RDDs. 2. Recovery from the failure of the driver running the application. Use the metadata checkpoint recovery processing information. Note that simple flow applications without the aforementioned stateful transformations may not be running checkpoint at run time. In this case, the recovery from the driver failure will be partially restored (the data received but not yet processed will be lost). This is usually acceptable, as many of the running Spark Streaming applications are. How to configure Checkpointing: Set a directory to hold checkpoint information in a fault-tolerant, reliable file system (HDFS, s3, etc.). This can be done via the streamingContext.checkpoint (checkpointDirectory) method. This runs the stateful transformation you introduced earlier. In addition, if you want to recover from a driver failure, you should rewrite your Streaming application in the following way. 1. When the application is started for the first time, create a new StreamingContext, start all Streams, and then call the start () method. 2. When the application restarts due to a failure, it will recreate the StreamingContext from the checkpoint directory checkpoint data // Function to create and setup a new StreamingContext def functionToCreateContext (): StreamingContext = {val ssc = new StreamingContext (...) // new context val lines = ssc.socketTextStream (...) // create DStreams ... ssc.checkpoint ( checkpointDirectory) // set checkpoint directory ssc // Get StreamingContext from checkpoint data or create a new one val context = StreamingContext.getOrCreate (checkpointDirectory, functionToCreateContext _) // Do additional setup on context that needs to be done, context. // Start the context context.start () context.awaitTermination () If the checkpointDirectory exists, the context will be re-created using the checkpoint data. If this directory does not exist, it will call the functionToCreateContext function to create a new context and create DStreams. In addition to using getOrCreate, developers must ensure that the driver process automatically restarts when a failure occurs. This can only be done by deploying the infrastructure running the application. Note that RDD checkpointing has a storage cost. This leads to increased processing time for batch data (RDDs included). Therefore, you need to be careful about setting the batch interval. With a minimum batch size (containing 1 second of data), each batch of checkpoint data significantly reduces the throughput of the operation. Conversely, too little checkpointing can result in increased pedigree and task size, which can have detrimental effects. Because stateful transformations require an RDD checkpoint. The default interval is a multiple of batch interval time, at least 10 seconds. It can be set via dstream.checkpoint. In a typical scenario, setting the checkpoint interval to be 5-10 on the DStream's sliding interval is a good idea.
Third, the deployment of applications and upgrade applications
** Deploy Application **: There are a few steps to running a Spark Streaming application: 1. A cluster with managers - this is the requirement for any Spark application, as detailed in the Deployment Guide 2. Type the application as jar package - you have to compile your app as jar package. If you start the application with spark-submit, you do not need to package Spark and Spark Streaming into this jar. If your application uses advanced sources (such as kafka, flume), you need to package their associated external artifacts and their dependencies into the application jar that needs to be deployed. For example, an application that uses TwitterUtils needs to package spark-streaming-twitter_2.10 and all of its dependencies into the application jar. Enough memory for executors - Because received data must be stored in memory, executors must be configured with enough memory to hold the received data. Note that if you are doing a 10-minute window operation, your system's memory should hold at least 10 minutes of data. So, the application's memory requirements depend on the operation that uses it. 4. Configure checkpointing - If the stream application needs checkpointing and then a fault-tolerant storage directory that is compatible with the Hadoop API must be configured as a checkpoint directory, the streaming application writes checkpoint information to that directory for error recovery. 5. Configure automatic restart of the application driver - In order to automatically recover from the driver failure, the deployment facility that runs the streaming application must be able to monitor the driver process and restart it if it fails. Different cluster managers, there are different tools to get this function 6.Spark Standalone: A Spark application driver can be submitted to the Spark stand-alone cluster running, which means the driver runs on a worker node. Further, an independent cluster manager can be instructed to monitor the driver and restart the driver if the driver fails (either due to a non-zero exit code such as exit (1), or due to a node running the driver) . YARN: YARN provides a similar mechanism for automatically restarting applications. 8.Mesos: Mesos can provide this functionality with Marathon 9. Configuring write ahead logs - In Spark 1.2, we introduced a new experimental feature, write ahead logs, . If this feature is on, all data retrieved from the receiver will write the pre-write log to the configured checkpoint directory. This prevents loss of data lost to the driver, thus ensuring zero data loss. This feature can be enabled by setting the configuration parameter spark.streaming.receiver.writeAheadLogs.enable to true. However, these strong semantics may be at the receiver's receiving throughput. This can be solved by running multiple receivers in parallel to increase throughput. In addition, when the pre-write log is turned on, the function of replicating data in Spark is not recommended because the log is already stored in a copy of the storage system. You can get this by setting the storage level of the input DStream to StorageLevel.MEMORY_AND_DISK_SER. ** Upgrading Application Code **: There are two possible ways to run a Spark Streaming application if you need to: 1. Start an upgraded application to run in parallel with a non-upgraded application. Once a new program is ready (with the same data it received for the program), the old application can be closed. This method supports sending data to two different destinations (one for the new program and one for the old program). First, a smooth shutdown (StreamingContext.stop (...) or JavaStreamingContext.stop (...)) is now Some applications. Before closing, to ensure that the data has been received completely processed. Then, you can start the upgraded application, and the upgraded application starts following the point of the old application. This method supports only input sources (such as flume, kafka) that have source-side caching because the data needs to be cached when the old application is down and the upgraded application is not started yet.
Fourth, monitoring applications
In addition to Spark's monitoring capabilities, Spark Streaming adds some proprietary features. When using StreamingContext, the Spark web UI displays an added Streaming menu to show running receivers (receivers are alive, received records, receiver errors, etc.) and finished batch statistics (batch time, queue wait wait). This can be used to monitor the flow of application processing. In the WEB UI Processing Time and Scheduling Delay two metrics is very important. The first indicator indicates the batch data processing time, the second indicator indicates the waiting time of the current batch in the queue after the previous batch is completed. If the batch processing time longer than the batch interval or queue waiting time continues to increase, which indicates that the system can not batch data generated by the speed of processing these data, the entire process lags behind. In this case, consider reducing the batch time. The processing of a Spark Streaming program can also be monitored via the StreamingListener interface, which allows you to get the receiver status and processing time. Note that this interface is the developer API and it is possible to provide more information in the future.
Five, performance tuning
The Spark Streaming application in the cluster requires some tuning for the best performance. This chapter introduces several parameters and configurations to improve the performance of the Spark Streaming application. You need to consider two things: 1. Efficient use of cluster resources to reduce batch data processing time 2. Set the correct batch size (size), the data processing speed to catch up with the data receiving speed to reduce the execution time of batch data set Correct batch capacity memory tuning
Six, to reduce the batch data execution time
Parallelism of data reception: Receiving data over a network such as kafka, flume, socket, etc. requires that the data be deserialized and saved to Spark. If data reception becomes the bottleneck of the system, consider receiving data in parallel. Note that each input DStream creates a receiver (running on the worker machine) to receive a single data stream. Create multiple input DStreams and configure them to receive streams of different partitions from the source for multi-stream reception. For example, a single input DStream that receives two topic data can be split into two kafka input streams, one for each. This will run two receivers on both workers, allowing data to be received in parallel, improving overall throughput. Multiple DStreams can be combined to generate a single DStream, which is then used on a single input DStream transformation operation to apply to the merged DStream. val numStreams = 5 val kafkaStreams = (1 to numStreams) .map {i => KafkaUtils.createStream (...)} val unifiedStream = streamingContext.union (kafkaStreams) unifiedStream.print () Another parameter to consider is the receiver Blocking time. For most receivers, the received data is merged into a big data block before being stored in Spark memory. The number of blocks in each batch of data determines the number of tasks. This task is to receive data with a map-like transformation operation. The blocking interval is determined by the configuration parameter spark.streaming.blockInterval, which defaults to 200 milliseconds. An alternative to multiple input streams or multiple receivers is to explicitly reallocate the incoming data stream (using inputStream.repartition ()) and distribute the received batch data through the number of machines in the cluster before proceeding further. Parallelism in data processing: If the number of concurrent tasks running on the compute stage is not large enough, the resources of the cluster will not be fully utilized. For example, for distributed reduce operations such as reduceByKey and reduceByKeyAndWindow, the default number of concurrent tasks is determined by the configuration property (configuration.html # spark-properties) spark.default.parallelism. You can pass parallelism through the parameters (PairDStreamFunctions (api / scala / index.html # org.apache.spark.streaming.dstream.PairDStreamFunctions)), or set the parameter spark.default.parallelism to change the default value. Data Serialization: The total cost of data serialization is usually large, especially when sub-second level batch data is received. Here are two related points: Serialization of RDD data in Spark. For data serialization, please refer to Spark Optimization Guide. Note that unlike Spark, the default RDD will be persisted as a serialized byte array to reduce the pauses associated with garbage collection. Serialization of input data. Acquire data from external sources into Spark. The byte data fetched needs to be deserialized from byte and then re-serialized into Spark according to Spark's serialization format. Therefore, deserialization of input data can be a bottleneck. Startup expenses for tasks: The number of tasks started per second is very large (50 or more). The cost of sending a job to a slave is significant, making it very difficult for requests to get sub-second responses. By the following changes can reduce expenses 1. Task serialization. Any running kyro serialization can reduce the size of the task, reducing the time it takes for the task to send to the slave. 2. Execution mode. Running Spark in Standalone mode or in coarse-grained Mesos mode results in a shorter task start-up time than running Spark in fine-grained Mesos mode. You can get more information by running Spark under Mesos.
Seven, set the correct batch capacity
For Spark Streaming applications to run stably in a cluster, the system should be able to process the received data enough (that is, the processing speed should be greater than or equal to the speed of receiving the data). This can be observed by streaming web UI. The batch time should be less than the batch interval. Depending on the nature of the flow calculation, batch intervals can significantly affect the data processing rate, which can be maintained by the application. Consider the WordCountNetwork example, which may print a word count every 2 seconds for a particular data rate (2 seconds between batches) but can not print a word count every 500 milliseconds. So, in order to maintain the expected data Spark processing rate in a production environment, you should set a suitable batch interval (ie, batch data capacity). A good way to find the correct lot size is to test your application with a conservative batch interval (5-10 seconds) and a low data rate. To verify that your system meets the data processing speed, you can check for end-to-end delay values (see "Total delay" in the Spark driver's log4j log or utilize the StreamingListener interface). The system is stable if the delay is stable. If the delay continues to grow, then the system can not keep up with data processing rates and is not stable. You can try to increase the data processing rate or reduce the batch capacity for further testing. Note that it may be normal for the momentary delay to increase as a result of the instantaneous data processing speed, provided the delay is back to a low value (less than the batch capacity).
Eight, memory tuning
Tuning the use of memory and the garbage collection behavior of Spark applications is described in detail in the Spark Optimization Guide. In this section, we highlight a few of the highly recommended customization options that reduce pauses in Spark Streaming application garbage collection for more stable batch processing time. 1.Default persistence level of DStreams: Unlike RDDs, the default persistence level is to serialize data into memory (DStream is StorageLevel.MEMORY_ONLY_SER and RDD is StorageLevel.MEMORY_ONLY). Even saving data as a serialized form increases the overhead of serialization / deserialization, but can significantly reduce garbage collection pauses. 2.Clearing persistent RDDs: Spark Streaming generated persistence RDDs will be flushed from memory by default through the Spark built-in strategy (LUR). If spark.cleaner.ttl is already set, an older persistence RDD older than this will be cleaned up on time. As mentioned earlier, this value needs to be set carefully based on the operation of the Spark Streaming application. However, you can set the configuration option spark.streaming.unpersist to true to smarter unpersist RDDs. This configuration allows the system to find out which RDDs do not need to be frequently kept and then persist them. This can reduce Spark RDD's memory usage and may also improve garbage collection behavior. 3.Concurrent garbage collector: The use of concurrent mark - clear garbage collection can further reduce the pause time garbage collection. Although concurrent garbage collection reduces overall system throughput, it is still recommended to use it for more stable batch processing time.
Nine, fault-tolerant semantics
The basic fault-tolerant semantics of RDD: 1. An RDD is immutable to determine repeatable, computationally distributed data sets. Each RDD remembers the lineage of a deterministic operation that was created on a fault-tolerant input dataset to create the RDD. 2. If the partition of any RDD is lost because of a node failure, this partition can be recalculated from the source-fault-tolerant data set by operating the pedigree. Assuming all the RDD transformations are OK, the data that is eventually converted is the same no matter what happens in the Spark machine. Data received and replicated: This data survives a single worker node failure because another node has a copy of this data saved. 5.Data received but buffered for replication: Because there is no duplicate save, in order to recover data, the only way is to reread the data from the source. There are two major concerns: 1.worker node failure: Any worker node running executor may fail, so all memory data in this node will be lost. If any of the receivers are running on the wrong node, their cache data will be lost 2.Driver Node Failure: If the Driver node running the Spark Streaming application fails, it is clear that the SparkContext will be lost and all executors executing on it will also Lost File Semantics as Input Source: Spark Streaming can always recover from any error and execute all data if all of the input data exists on a fault-tolerant file system such as HDFS. This gives exactly-once semantics, that is, no matter what happens, all the data will be processed exactly once. Receiver-based input source semantics: For receiver-based input sources, the fault-tolerant semantics depend both on the fault condition and on the type of receiver. As discussed earlier, there are two types of receivers: 1.Reliable Receiver: These receivers only tell the trusted source after data has been copied. If such a receiver fails, buffered (non-duplicated) data will not be acknowledged by the source. If the receiver is restarted, the source resends the data, so no data is lost. 2.Unreliable Receiver: When a worker or driver node fails, this receiver loses the semantics of the data output operation: All data is modeled as RDDs according to the pedigree it identifies for operation, and all recalculations will produce the same result. All DStream transformations have exactly-once semantics. That is, the final conversion result is the same even if one of the worker nodes fails. However, output operations such as foreachRDD have at least least once semantics, which means that the transformed data may be written to an external entity more than once in the event of a worker event failure. Using saveAs *** Files to save the data to HDFS, writing it multiple times is acceptable (because the files are covered by the same data).