1. Framework Overview
?? The architecture of event processing is as follows.
2. Optimization Summary
?? When we deploy the entire solution for the first time,kafka
Andflume
The components are executed very well,spark streaming
It takes 4-8 minutes for an application to process a singlebatch
. There are two reasons for this delay: First, we useDataFrame
To strengthen the data, and the enhanced data needshive
Read a large amount of data. Second, our parameter configuration is not ideal.
?? In order to optimize our processing time, we have made two improvements: first, caching suitable data and partitions; second, changing configuration parameters to optimize spark applications. Run the spark Applicationspark-submit
The command is as follows. Through parameter optimization and code improvement, we significantly reduced the processing time from 4-8 minutes to less than 25 seconds.
/opt/app/dev/spark-1.5.2/bin/spark-submit --jars /opt/cloudera/parcels/CDH/jars/zkclient-0.3.jar,/opt/cloudera/parcels/CDH/jars/kafka_2.10-0.8.1.1.jar,/opt/app/dev/jars/datanucleus-core-3.2.2.jar,/opt/app/dev/jars/datanucleus-api-jdo-3.2.1.jar,/opt/app/dev/jars/datanucleus-rdbms-3.2.1.jar --files /opt/app/dev/spark-1.5.2/conf/hive-site.xml,/opt/app/dev/jars/log4j-eir.properties --queue spark_service_pool --master yarn --deploy-mode cluster --conf "spark.ui.showConsoleProgress=false" --conf "spark.driver.extraJavaOptions=-XX:MaxPermSize=6G -XX:+UseConcMarkSweepGC -Dlog4j.configuration=log4j-eir.properties" --conf "spark.sql.tungsten.enabled=false" --conf "spark.eventLog.dir=hdfs://nameservice1/user/spark/applicationHistory" --conf "spark.eventLog.enabled=true" --conf "spark.sql.codegen=false" --conf "spark.sql.unsafe.enabled=false" --conf "spark.executor.extraJavaOptions=-XX:+UseConcMarkSweepGC -Dlog4j.configuration=log4j-eir.properties" --conf "spark.streaming.backpressure.enabled=true" --conf "spark.locality.wait=1s" --conf "spark.streaming.blockInterval=1500ms" --conf "spark.shuffle.consolidateFiles=true" --driver-memory 10G --executor-memory 8G --executor-cores 20 --num-executors 20 --class com.bigdata.streaming.OurApp \ /opt/app/dev/jars/OurStreamingApplication.jar external_props.conf
?? The following describes the changed parameters in detail.
2.1 driver options
?? Note that,driver
Run inspark on yarn
In cluster mode. Becausespark streaming
An application is a long-term running task and generates a large log file. To solve this problem, we limit the number of messages that write logs, and useRollingFileAppender
They are limited in size. We also disabledspark.ui.showConsoleProgress
To Disable Console log messages.
?? Through the test, ourdriver
Memory is frequently used up because the permanent generation space is filled up (the Permanent generation space is a place where classes, methods, and other storage are stored and will not be reassigned ). This problem can be solved by raising the size of the permanent replacement space to 6 GB.
spark.driver.extraJavaOptions=-XX:MaxPermSize=6G
2.2 garbage collection
?? Because of ourspark streaming
An application is a long-running process. After processing it for a period of time, we noticed thatGC
The pause time is too long. We want to reduce or maintain this time in the background. AdjustmentUseConcMarkSweepGC
Parameter is a technique.
--conf "spark.executor.extraJavaOptions=-XX:+UseConcMarkSweepGC -Dlog4j.configuration=log4j-eir.properties" \
2.3 disable tungsten
??Tungsten
Yesspark
Major improvements to the execution engine. However, there is a problem with its first version, so we will temporarily disable it.
spark.sql.tungsten.enabled=falsespark.sql.codegen=falsespark.sql.unsafe.enabled=false
2.4 enable Back Pressure
??Spark Streaming
An error occurs when the batch processing time is greater than the batch interval. In other wordsspark
Reading data is slowerkafka
The data arrival speed. If the execution based on this throughput is too long, it will cause instability. Receiveexecutor
Memory overflow. Set the following parameters to solve this problem.
spark.streaming.backpressure.enabled=true
2.5 adjust localization and block configuration
?? The following two parameters are complementary. It determines the data localizationtask
Orexecutor
Wait time.spark streaming receiver
Blocks data. The larger the block, the better, but if the data is not localizedexecutor
It will be moved to the task execution place through the network. We must find a good balance between the two parameters, because we do not want the data block to be too large, and do not want to wait too long for localization. We want all tasks to be completed within several seconds.
?? Therefore, we changed the localization options from 3 s to 1 s, and we also changed the block interval to 1.5 s.
--conf "spark.locality.wait=1s" --conf "spark.streaming.blockInterval=1500ms" \
2.6 merge temporary files
?? Inext4
In the file system, we recommend that you enable this function. This will generate fewer temporary files.
--conf "spark.shuffle.consolidateFiles=true" \
2.7 enable executor Configuration
?? After you configurekafka Dstream
You can specify the number of concurrent consumption threads. However,kafka Dstream
The consumer will run in the samespark driver
Node. Therefore, to consume data concurrently from multiple machineskafka topic
, We must instantiate multipleDstream
. Although the correspondingRDD
But run multiple application instances and use them as the samekafka consumer group
.
?? To achieve this goal, we set 20executor
And eachexecutor
It has 20 cores.
--executor-memory 8G--executor-cores 20--num-executors 20
2.8 cache Method
?? UseRDD
Previous CacheRDD
But remember to delete it from the cache before the next iteration. It is very useful to cache data that needs to be used multiple times. However, do not make the number of partitions too large. Reducing the number of partitions can minimize the scheduling latency. The following formula is used to calculate the number of partitions.
# of executors * # of cores = # of partitions
Spark streaming real-time processing applications