Spark streaming real-time processing applications

Last Update:2018-11-02 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

1. Framework Overview

?? The architecture of event processing is as follows.

2. Optimization Summary

?? When we deploy the entire solution for the first time,kafkaAndflumeThe components are executed very well,spark streamingIt takes 4-8 minutes for an application to process a singlebatch. There are two reasons for this delay: First, we useDataFrameTo strengthen the data, and the enhanced data needshiveRead a large amount of data. Second, our parameter configuration is not ideal.

?? In order to optimize our processing time, we have made two improvements: first, caching suitable data and partitions; second, changing configuration parameters to optimize spark applications. Run the spark Applicationspark-submitThe command is as follows. Through parameter optimization and code improvement, we significantly reduced the processing time from 4-8 minutes to less than 25 seconds.

/opt/app/dev/spark-1.5.2/bin/spark-submit  --jars  /opt/cloudera/parcels/CDH/jars/zkclient-0.3.jar,/opt/cloudera/parcels/CDH/jars/kafka_2.10-0.8.1.1.jar,/opt/app/dev/jars/datanucleus-core-3.2.2.jar,/opt/app/dev/jars/datanucleus-api-jdo-3.2.1.jar,/opt/app/dev/jars/datanucleus-rdbms-3.2.1.jar --files /opt/app/dev/spark-1.5.2/conf/hive-site.xml,/opt/app/dev/jars/log4j-eir.properties --queue spark_service_pool --master yarn --deploy-mode cluster --conf "spark.ui.showConsoleProgress=false" --conf "spark.driver.extraJavaOptions=-XX:MaxPermSize=6G -XX:+UseConcMarkSweepGC -Dlog4j.configuration=log4j-eir.properties" --conf "spark.sql.tungsten.enabled=false" --conf "spark.eventLog.dir=hdfs://nameservice1/user/spark/applicationHistory" --conf "spark.eventLog.enabled=true" --conf "spark.sql.codegen=false" --conf "spark.sql.unsafe.enabled=false" --conf "spark.executor.extraJavaOptions=-XX:+UseConcMarkSweepGC -Dlog4j.configuration=log4j-eir.properties" --conf "spark.streaming.backpressure.enabled=true" --conf "spark.locality.wait=1s" --conf "spark.streaming.blockInterval=1500ms" --conf "spark.shuffle.consolidateFiles=true" --driver-memory 10G --executor-memory 8G --executor-cores 20 --num-executors 20 --class com.bigdata.streaming.OurApp \ /opt/app/dev/jars/OurStreamingApplication.jar external_props.conf

?? The following describes the changed parameters in detail.

2.1 driver options

?? Note that,driverRun inspark on yarnIn cluster mode. Becausespark streamingAn application is a long-term running task and generates a large log file. To solve this problem, we limit the number of messages that write logs, and useRollingFileAppenderThey are limited in size. We also disabledspark.ui.showConsoleProgressTo Disable Console log messages.

?? Through the test, ourdriverMemory is frequently used up because the permanent generation space is filled up (the Permanent generation space is a place where classes, methods, and other storage are stored and will not be reassigned ). This problem can be solved by raising the size of the permanent replacement space to 6 GB.

spark.driver.extraJavaOptions=-XX:MaxPermSize=6G

2.2 garbage collection

?? Because of ourspark streamingAn application is a long-running process. After processing it for a period of time, we noticed thatGCThe pause time is too long. We want to reduce or maintain this time in the background. AdjustmentUseConcMarkSweepGCParameter is a technique.

--conf "spark.executor.extraJavaOptions=-XX:+UseConcMarkSweepGC -Dlog4j.configuration=log4j-eir.properties" \

2.3 disable tungsten

??TungstenYessparkMajor improvements to the execution engine. However, there is a problem with its first version, so we will temporarily disable it.

spark.sql.tungsten.enabled=falsespark.sql.codegen=falsespark.sql.unsafe.enabled=false

2.4 enable Back Pressure

??Spark StreamingAn error occurs when the batch processing time is greater than the batch interval. In other wordssparkReading data is slowerkafkaThe data arrival speed. If the execution based on this throughput is too long, it will cause instability. ReceiveexecutorMemory overflow. Set the following parameters to solve this problem.

spark.streaming.backpressure.enabled=true

2.5 adjust localization and block configuration

?? The following two parameters are complementary. It determines the data localizationtaskOrexecutorWait time.spark streaming receiverBlocks data. The larger the block, the better, but if the data is not localizedexecutorIt will be moved to the task execution place through the network. We must find a good balance between the two parameters, because we do not want the data block to be too large, and do not want to wait too long for localization. We want all tasks to be completed within several seconds.

?? Therefore, we changed the localization options from 3 s to 1 s, and we also changed the block interval to 1.5 s.

--conf "spark.locality.wait=1s" --conf "spark.streaming.blockInterval=1500ms" \

2.6 merge temporary files

?? Inext4In the file system, we recommend that you enable this function. This will generate fewer temporary files.

--conf "spark.shuffle.consolidateFiles=true" \

2.7 enable executor Configuration

?? After you configurekafka DstreamYou can specify the number of concurrent consumption threads. However,kafka DstreamThe consumer will run in the samespark driverNode. Therefore, to consume data concurrently from multiple machineskafka topic, We must instantiate multipleDstream. Although the correspondingRDDBut run multiple application instances and use them as the samekafka consumer group.

?? To achieve this goal, we set 20executorAnd eachexecutorIt has 20 cores.

--executor-memory 8G--executor-cores 20--num-executors 20

2.8 cache Method

?? UseRDDPrevious CacheRDDBut remember to delete it from the cache before the next iteration. It is very useful to cache data that needs to be used multiple times. However, do not make the number of partitions too large. Reducing the number of partitions can minimize the scheduling latency. The following formula is used to calculate the number of partitions.

# of executors * # of cores = # of partitions

Spark streaming real-time processing applications

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Spark streaming real-time processing applications

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Spark streaming real-time processing applications

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support