Spark streaming real-time processing applications

Source: Internet
Author: User
1. Framework Overview

?? The architecture of event processing is as follows.

2. Optimization Summary

?? When we deploy the entire solution for the first time,kafkaAndflumeThe components are executed very well,spark streamingIt takes 4-8 minutes for an application to process a singlebatch. There are two reasons for this delay: First, we useDataFrameTo strengthen the data, and the enhanced data needshiveRead a large amount of data. Second, our parameter configuration is not ideal.

?? In order to optimize our processing time, we have made two improvements: first, caching suitable data and partitions; second, changing configuration parameters to optimize spark applications. Run the spark Applicationspark-submitThe command is as follows. Through parameter optimization and code improvement, we significantly reduced the processing time from 4-8 minutes to less than 25 seconds.

/opt/app/dev/spark-1.5.2/bin/spark-submit  --jars  /opt/cloudera/parcels/CDH/jars/zkclient-0.3.jar,/opt/cloudera/parcels/CDH/jars/kafka_2.10-0.8.1.1.jar,/opt/app/dev/jars/datanucleus-core-3.2.2.jar,/opt/app/dev/jars/datanucleus-api-jdo-3.2.1.jar,/opt/app/dev/jars/datanucleus-rdbms-3.2.1.jar --files /opt/app/dev/spark-1.5.2/conf/hive-site.xml,/opt/app/dev/jars/log4j-eir.properties --queue spark_service_pool --master yarn --deploy-mode cluster --conf "spark.ui.showConsoleProgress=false" --conf "spark.driver.extraJavaOptions=-XX:MaxPermSize=6G -XX:+UseConcMarkSweepGC -Dlog4j.configuration=log4j-eir.properties" --conf "spark.sql.tungsten.enabled=false" --conf "spark.eventLog.dir=hdfs://nameservice1/user/spark/applicationHistory" --conf "spark.eventLog.enabled=true" --conf "spark.sql.codegen=false" --conf "spark.sql.unsafe.enabled=false" --conf "spark.executor.extraJavaOptions=-XX:+UseConcMarkSweepGC -Dlog4j.configuration=log4j-eir.properties" --conf "spark.streaming.backpressure.enabled=true" --conf "spark.locality.wait=1s" --conf "spark.streaming.blockInterval=1500ms" --conf "spark.shuffle.consolidateFiles=true" --driver-memory 10G --executor-memory 8G --executor-cores 20 --num-executors 20 --class com.bigdata.streaming.OurApp \ /opt/app/dev/jars/OurStreamingApplication.jar external_props.conf

?? The following describes the changed parameters in detail.

2.1 driver options

?? Note that,driverRun inspark on yarnIn cluster mode. Becausespark streamingAn application is a long-term running task and generates a large log file. To solve this problem, we limit the number of messages that write logs, and useRollingFileAppenderThey are limited in size. We also disabledspark.ui.showConsoleProgressTo Disable Console log messages.

?? Through the test, ourdriverMemory is frequently used up because the permanent generation space is filled up (the Permanent generation space is a place where classes, methods, and other storage are stored and will not be reassigned ). This problem can be solved by raising the size of the permanent replacement space to 6 GB.

spark.driver.extraJavaOptions=-XX:MaxPermSize=6G
2.2 garbage collection

?? Because of ourspark streamingAn application is a long-running process. After processing it for a period of time, we noticed thatGCThe pause time is too long. We want to reduce or maintain this time in the background. AdjustmentUseConcMarkSweepGCParameter is a technique.

--conf "spark.executor.extraJavaOptions=-XX:+UseConcMarkSweepGC -Dlog4j.configuration=log4j-eir.properties" \
2.3 disable tungsten

??TungstenYessparkMajor improvements to the execution engine. However, there is a problem with its first version, so we will temporarily disable it.

spark.sql.tungsten.enabled=falsespark.sql.codegen=falsespark.sql.unsafe.enabled=false
2.4 enable Back Pressure

??Spark StreamingAn error occurs when the batch processing time is greater than the batch interval. In other wordssparkReading data is slowerkafkaThe data arrival speed. If the execution based on this throughput is too long, it will cause instability. ReceiveexecutorMemory overflow. Set the following parameters to solve this problem.

spark.streaming.backpressure.enabled=true
2.5 adjust localization and block configuration

?? The following two parameters are complementary. It determines the data localizationtaskOrexecutorWait time.spark streaming receiverBlocks data. The larger the block, the better, but if the data is not localizedexecutorIt will be moved to the task execution place through the network. We must find a good balance between the two parameters, because we do not want the data block to be too large, and do not want to wait too long for localization. We want all tasks to be completed within several seconds.

?? Therefore, we changed the localization options from 3 s to 1 s, and we also changed the block interval to 1.5 s.

--conf "spark.locality.wait=1s" --conf "spark.streaming.blockInterval=1500ms" \
2.6 merge temporary files

?? Inext4In the file system, we recommend that you enable this function. This will generate fewer temporary files.

--conf "spark.shuffle.consolidateFiles=true" \
2.7 enable executor Configuration

?? After you configurekafka DstreamYou can specify the number of concurrent consumption threads. However,kafka DstreamThe consumer will run in the samespark driverNode. Therefore, to consume data concurrently from multiple machineskafka topic, We must instantiate multipleDstream. Although the correspondingRDDBut run multiple application instances and use them as the samekafka consumer group.

?? To achieve this goal, we set 20executorAnd eachexecutorIt has 20 cores.

--executor-memory 8G--executor-cores 20--num-executors 20
2.8 cache Method

?? UseRDDPrevious CacheRDDBut remember to delete it from the cache before the next iteration. It is very useful to cache data that needs to be used multiple times. However, do not make the number of partitions too large. Reducing the number of partitions can minimize the scheduling latency. The following formula is used to calculate the number of partitions.

# of executors * # of cores = # of partitions

Spark streaming real-time processing applications

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.