Spark-streaming data volume increased from 1% to full-scale combat

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Schema background spark parameter optimization increase Executor-cores resize executor-memory num-executors set first deal decompression policy x Message Queuing bug bypass PHP end limit processing Action 1 processing speed increased from 1 to 10 peak Period non-peak status description increased from 10 to 50 peak off-peak status description use pipeline to elevate the QPS of the Redis 50 to a full-scale PM period Peak State Analysis

Architecture

background

Since the access to our call chain computing platform of more than 12 modules, Kafka topic more than 24, initially considering the speed of spark processing, as well as the feasibility of sampling (Google dapper thesis demonstrated), we only sample 1% processing. However, the recent access to WF logs requires a full amount of processing, so we have to increase the processing speed of the spark task to meet these requirements. Spark parameter Optimization

Our task execution interval is 1 minutes. Before optimization, our spark task processing time is to handle 2w Traceid requires about 50s, there is no increase in the amount of data control, so the task must be increased speed. The following actions are made: increase executor-cores size

Increasing concurrency is the most effective way to improve execution speed, but because we used the Redis standalone version, I set the exector-cores parameter to 2 o'clock, the link rejection error, query the next Redis standalone version of the number of concurrent multiple when this problem occurs. Later adjusted to Redis cluster, resolved this problem, now Exector-cores set to 4 adjustment executor-memory

Originally this parameter executor-memory set to 4G, found that our task does not need, later to 1G also can meet the demand of sampling 1%, our cluster capacity total memory is 800G, and there are other tasks in the run, so if you do not need so much memory, you can adjust the parameter. The setting of this parameter needs to be determined by the task execution, if the discovery task GC time is relatively large, or the occurrence of oom, then please increase this parameter appropriately, because we want to increase to the full amount, so I temporarily adjust to 6G, see the actual running situation after I adjust. num-executors Settings

Num-executors from the original 30 to the current 56 (for convenience can be divisible by 8 slave, so set 56) first processing decompression strategy

Limit the amount of data that is processed for the first time because the cold boot causes memory usage to be too large for the first time the job is started

Spark.streaming.backpressure.enabled=true
spark.streaming.backpressure.initialrate=200

2.x Message Queuing bug avoidance

Analysis on the reason of spark-streaming stage compaction PHP end limit processing

Due to the single Docker (4 core 16G) machine CPU limit, currently a docker can only be processed in a minute 2w Traceid, and the current ES system is not up to the number of 200w per minute storage data, so I made a limit on the PHP side, All the IDs are taken from the SortedSet, but only 2w is saved to the queue, and all the others are deleted.

We will change this strategy after our Smart computing system is up and coming online. Core code:

Action

We have 3 stages of Ascension, respectively, 1% to 10%,10% ascension to 50%,50% to full amount, each stage is optimized, and gradually meet the requirements. 1% of processing speed

The above optimization, our processing speed from the original high staging 50s to about 15s, non-peak 20s up to 10s, efficiency improvement 50%+. lifted from 1% to 10% Peak

Take-out business peak from 11 o'clock, we looked at the processing speed at this stage.

Average processing time: around 30s. Off-Peak

Around 2 o'clock basically in off-peak

Job processing speed

Stage processing speed (2 stages)

Average processing time: around 15s. Condition Description

The time-consuming ratio of storage to hbase and storage to Redis is 2:1, but HBase includes the calculation of data, the process of aggregation, so it is normal. Redis only stores the process consumes about 1/3 of the time, but did not reach the bottleneck, temporarily first not optimize from 10% to 50% Peak

Average processing time 1min non-peak

Average processing time: 50s Condition Description

The off-peak time has reached around 45s, and it has been time-consuming to store data to Redis, which is close to 1:1hbase. It seems that we have to optimize Redis's write QPS.
Optimization starts from two aspects: Jedis uses pipeline to add data to enhance Rediscluster's QPS adoption pipeline

Because Jediscluster official did not provide pipeline method, so found a personal development of jedisclusterpipeline, using the effect is obvious
Redis storage Time-consuming non-peak period reduced to about 1s, the effect is obvious. Off-peak time is as follows

improve the QPS of Redis

Use benchmark measured our QPS, reached 11w, supposedly 80w (peak) also used less than 20 s, and later asked a professional op, the answer is SortedSet is ordered sequence, storage speed will be much slower, and can not be raised by the node to enhance the QPS. So gave up. 50% increase to full volume Off-Peak

Average processing speed: 30s. The amount of data is around 10G. Traceid is about 55w. Afternoon Session

Average processing speed >1min, data volume is around 20G, Traceid at about 100w Peak

Already Gameover, the data volume about 30g,traceid about 170w. State Analysis

This phase of Redis time is not a problem, the problem has been in the 20G processing of the amount of time spent is too large, close to the time interval of 1min. So we need to tune the spark parameters again.

Commissioning a whole day later found that parameter tuning is not enough, because only 8 machines, 12CPU, whether it is num-executors (executor-cores) or 8*12, or (num-executors) * * ( Executor-cores) has reached the machine calculation bottleneck, can no longer be added, and time is more than 1min, so the parameters have not been solved. Only from the following 3 aspects to solve the improvement of the QPS of HBase, the current one-time storage of 170wtraceid,30g data, is very large, about 15s increase spark on yarn cluster machine, increase the number of CPUs increase the computational capacity of the Code level optimization, See if there's a place for optimization.

Since the first two require a third-party solution, first optimize from the code level.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Spark-streaming data volume increased from 1% to full-scale combat

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Spark-streaming data volume increased from 1% to full-scale combat

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support