Spark Streaming Practice and optimization

Last Update:2016-02-05 Source: Internet

Author: User

Tags zookeeper grafana

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Published in: February 2016 issue of the journal programmer. Links: http://geek.csdn.net/news/detail/54500

Xu Xin, Dong Xicheng

In streaming computing, Spark streaming and Storm are currently the most widely used two compute engines. Among them, spark streaming is an important part of the spark ecosystem, enabling the use of the spark compute engine. As shown in 1, Spark streaming supports a number of data sources, such as Kafka, Flume, TCP, and so on. The internal data representation of spark streaming is Dstream (discretized stream, discrete data flow), whose interface design is very similar to the RDD, making it very friendly to spark users. The core idea of Spark streaming is to turn streaming into a "micro-batch" that splits the data stream in time, and the data in each slice corresponds to an rdd, which can be quickly calculated using the spark engine. Because spark streaming uses a micro-batch approach, it is strictly a near-real-time processing system rather than a true streaming system.

Figure 1:spark Streaming Data flow

Storm is another well-known open source streaming computing engine in this field, a true streaming system that reads a single piece of data from a data source and processes it individually. Faster response time (less than one second) compared to spark Streaming,storm, which is better suited for low latency scenarios such as credit card fraud systems, advertising systems, etc. However, the advantage of comparing Storm,spark streaming is that the throughput is large, the response time is also acceptable (in seconds) and compatible with other tool libraries in the Spark system, such as Mllib and Graphx. As a result, Spark streaming is a better choice for systems with time-insensitive and high flow rates.

Spark Streaming in Hulu applications

Hulu is a professional online video site in the United States, where a large number of users watch video online every day, resulting in a large number of user-viewed behavioral data. This data is accessed through the collection system into Hulu's big data platform for storage and further processing. On the big data platform, each team will design the appropriate algorithm to analyze and excavate the data to generate business value: The recommendation team from these data to dig out the user's interests and make accurate recommendations, the advertising team based on the user's historical behavior to push the most appropriate ads, The data team analyzes each dimension of the data to provide a reliable basis for the company's strategy development.

The implementation of the Hulu Big data platform follows the lambda architecture. The lambda architecture is a general-purpose, large-data-processing framework that includes the offline batch layer, the online accelerator layer, and the service layer, as shown in 2. Service tiers typically use HTTP services or custom-made clients to provide data access, and offline batch tiers typically use the batch computing framework spark and mapreduce for data analysis, and the online accelerator layer typically uses streaming real-time computing framework Spark Streaming and storm for data analysis.

Figure 2:LAMBDA Architecture schematic diagram

For the real-time computing section, Hulu uses Kafka, Codis, and spark streaming internally. Follow the flow of data below to introduce our project.

1. Collecting Data

Collecting data from the server log consists of two main sections:

Q from the web, mobile phone apps, set-top boxes and other equipment users of video viewing, advertising clicks, and other behavior, these behavior data recorded in the respective Nginx service log.

Q uses Flume to import user behavior data into both HDFs and Kafka, where data in HDFs is used for offline analysis, while data in Kafka is used for streaming real-time analysis.

Figure 3:hulu Data collection process

2. Store tag Data

Hulu uses HBase to store user tag data, including basic information such as gender, age, pay, and other model-inferred preference attributes. These properties need to be input to the calculation model, and hbase is slow to read randomly, and data needs to be synchronized to the cache server to speed up data reads. Redis is a widely used open source cache server, but it is a stand-alone system that does not support caching of large amounts of data well. To solve the problem of poor redis extensibility, pea pods Open source Codis, a distributed Redis solution. Hulu Codis into a Docker image and implements a one-click build cache system with automatic monitoring and repair capabilities. For finer-grained monitoring, Hulu has built multiple Codis caches, namely:

Q Codis-profile, synchronizing user attributes in HBase;

Q Codis-action, caching user behavior from Kafka;

Q Codis-result, records the results of the calculation.

3. Real-time data processing

Before everything is ready, start the Spark streaming program:

1) Spark streaming initiates Kafka Receiver and continuously pulls data from the Kafka server;

2) Every two seconds, the Kafka data is collated into an RDD, which is handed to the spark engine for processing;

3) For a user behavior, Spark gets the user's behavior record from the Codis-action cache and appends the new behavior to it;

4) Spark obtains all relevant properties of the user from Codis-action and Codis-profile, then executes the ad and the recommended calculation model, and finally writes the results to Codis-result, which is then used by the service layer to read these results in real time.

Spark Streaming Optimization Experience

In practice, the business logic is guaranteed to be complete first, so that the system runs stably when the Kafka input data is small, and the input and output meet the requirements of the project. Then start tuning to modify the spark streaming parameters, such as the number of executor, the number of cores, receiver traffic, and so on. Finally found that only tuning parameters can not fully meet the business scenario of this project, so there is a further optimization scheme, summarized as follows:

Executor initialization

Many machine learning models need to perform initialization methods at the first run, and also connect external databases, which often takes 5-10 minutes, which can be a potential destabilizing factor. In the spark streaming application, when receiver finishes initializing, it begins to receive data continuously and is consumed by driver scheduled tasks periodically. If executor takes a few minutes to prepare when it starts, it will cause the first job to remain unfinished, and driver will not dispatch a new job during this time period. At this time there will be a backlog of data in the Kafka receiver, as the backlog of data is increasing, most of the data will survive the new generation into the old age, and thus bring serious pressure on the Java GC, easy to cause application crashes.

The solution to this project is to modify the spark kernel to execute a user-defined initialization function before each executor receives the task, and some independent user logic can be executed in the initialization function. The sample code is as follows:

SC: Is Sparkcontext, Setupenvironment is the API for Hulu extensions

Sc.setupenvironment (() = {

Application.initialize ()//user application initialization, need to execute for several minutes

})

The scenario needs to change the Spark's Task Scheduler, first setting each executor to an uninitialized state. At this point, the scheduler will only assign an initialization task to the uninitialized state of the executor (execute the previously mentioned initialization function). When the initialization task is complete, the status of the scheduler update executor is initialized so that the executor can be assigned a normal calculation task.

Asynchronous processing of business logic in a task

In this project, the input parameters of the model are all from CODIS, even inside the model may access external storage, directly resulting in the model calculation time is not stable, a lot of time spent on network waiting.

In order to improve the system throughput, increasing the degree of parallelism is a common optimization scheme, but it is not applicable in the scenario of this project. The scheduling policy for the spark job is to wait for all the tasks on the previous job to complete and then schedule the next job. If the running time of a single task is unstable, it is easy to have individual tasks slow down the entire job, so that the resource utilization is not high, even the greater the degree of parallelism, the problem is more serious. A common solution to the problem of task instability is to increase the time interval for the micro batch of spark streaming, which causes the delay of the entire real-time system to become longer and is not recommended.

so here it is solved by asynchronously processing the business logic in the task. As shown in the following code, in a synchronous scenario, the execution of business logic within a task, the processing time is uncertain; In an asynchronous scenario, a task embeds the business logic into a thread, hands it to the thread pool, the task immediately ends, the executor reports to the driver, and the asynchronous process is very short. Within 100MS. In addition, when the number of threads in the thread pool is too large (in the case of qsize>100 in the code), synchronous processing is temporarily used. With the reverse pressure mechanism (see parameter spark.streaming.backpressure.enabled below), you can guarantee that the system will not crash due to too much data backlog. The experimental results show that the scheme greatly improves the throughput of the system.

< P class= "1" > //Synchronous processing

//function runbusinesslogic is the business logic in the Task, execution time is variable

Rdd.foreachpartition (partition = Runbusinesslogic (partition))

/ /asynchronous processing, ThreadPool is the thread pool

rdd.foreachpartition (partition = {

val qsize = threadPool.getQueue.size//thread pool backlog of threads

if (Qsize >) {

runbusinesslogic (partition)//Temporary synchronous processing

}

Threadpool.execute (new Runnable {

Override def run () = Runbusinesslogic (partition)

})

The asynchronous task also has drawbacks: if the executor exception occurs, the business logic that holds the thread pool cannot be recalculated, causing some data loss. It has been experimentally verified that data loss is only possible when the executor crashes, and is not common, and is acceptable in the scenario of this project.

Stability of Kafka Receiver

This project uses Kafka Receiver in spark streaming, essentially invoking Kafka's official client zookeeperconsumerconnector. The strategy is that each client registers itself as a temporary node under the fixed path of zookeeper, so that all clients know the presence of other clients and then automatically coordinates and allocates Kafka data resources. There is a disadvantage of this strategy, when a client and zookeeper connection state changes (disconnect or connected), all clients will be zookeeper coordination, redistribution of Kafka data resources, during which all clients disconnected from the Kafka connection, The system received no Kafka data until the reassignment was successful. If the network is poor, and the number of receiver is high, this strategy will cause data input instability, many spark streaming users encounter such problems. In our system, this strategy does not have a significant negative impact. It is worth noting that the Kafka client and zookeeper have a default parameter zookeeper.session.timeout.ms=6000, which indicates that the client and the Zookeeper connection session is valid for 6 seconds. Our clients have repeatedly appeared because the full GC has been disconnected from zookeeper for more than 6 seconds and then connected again, during which all clients were affected and the system performance was unstable. So the parameter zookeeper.session.timeout.ms=30000 is set in the project.

Yarn Resource preemption problem

Within Hulu, a long-time service like spark streaming shares yarn cluster resources with batch applications such as Mapredue, Spark, and hive. In a shared environment, the spark streaming service is often unstable due to a batch application that consumes a lot of network resources or CPU resources (although we use cgroup for resource isolation, but the results are poor). The more serious problem is that if an individual container crashes driver need to apply to yarn for a new container, or if the entire application crashes requiring a restart, Spark streaming cannot guarantee the quality of the online service by ensuring that sufficient resources are requested soon. To solve this problem, Hulu uses label-based scheduling's scheduling strategy to isolate several nodes from the yarn cluster to run spark streaming and other long-time services specifically to avoid competing with the batch program for resources.

Improve monitoring information

Monitoring reflects the performance state of the system operation, but also the basis of all optimization. Using graphite and Grafana as a third-party monitoring system, Hulu sends key performance parameters in the system, such as calculation time and number of times, to the graphite server to see the intuitive charts on the Grafana Web page.

Figure 4:graphite Monitoring information, showing the remaining number of logs in Kafka, a line corresponding to the historical margin of a partition

Figure 4 is the number of logs in the statistics Kafka, a line corresponds to a partition of historical margin, most of the situation remaining volume is close to 0, in line with expectations. In the figure of 09:55, the log margin began to appear very sharp peaks, and then quickly approaching 0. After a variety of data verification, confirmed that the Kafka data has been stable, while the spark streaming execution suddenly slowed down, the anti-pressure mechanism into effect, so Kafka receiver to reduce the rate of reading logs, resulting in Kafka data backlog; After a while, spark Streaming is back to normal, quickly consuming the data margin in the Kafka.

Intuitive monitoring system can effectively expose the problem, and then understand and strengthen the system. In our practice, the main monitoring indicators are:

Q Kafka The amount of data remaining

Q Spark's job run time and scheduling time

Q The calculation time for each task

Q Codis Number of visits, times, hits

In addition, there are scripts to analyze these statistics regularly, the exception of the email alert. When the log allowance is too large for Kafka in 4, there will be a continuous alarm message. Our experience is that the more detailed the monitoring, the easier it is to work on the optimization later.

Parameter optimization

The following table lists some of the key parameters in this project:

Spark.yarn.max.executor.failures	Executor the allowable failure limit, and if this limit is exceeded, the entire spark streaming will fail and need to be set to a larger
Spark.yarn.executor.memoryOverhead	The JVM overhead in executor, unlike heap memory, is too small to cause a memory overflow exception
Spark.receivers.num	Number of Kafka Receiver
Spark.streaming.receiver.maxRate	The maximum rate at which each receiver is able to accept data, which exceeds the peak value by approximately 50%
Spark.streaming.backpressure.enabled	Anti-pressure mechanism; If the current system delay is longer, receiver end will automatically reduce the rate of acceptance data, to avoid the system due to excessive data backlog and crashes
Spark.locality.wait	The system scheduling task will try to consider the locality of the data, if it exceeds the upper limit of the set time of spark.locality.wait, give up the locality; This parameter directly affects the task scheduling time
Spark.cleaner.ttl	The time-out of meta-information within the spark system; streaming long-term operation, too much meta information can affect performance

Summarize

The spark streaming product went live for more than a year, during a number of spark release upgrades, from the earliest 0.8 versions to the most recent 1.5.x releases. On the whole, the spark streaming is an excellent real-time computing framework that can be used online. However, there are still some deficiencies, including: Spark uses both heap and out-of-heap memory, lacks effective monitoring, and is difficult to analyze and debug when it encounters Oom; The new version of Spark has some exceptions, such as block loss during shuffle and memory overflow.

Spark Streaming Practice and optimization

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More