Dynamic batch size depth and Ratecontroller resolution in Spark streaming

Source: Internet
Author: User

Contents of this issue:

    • Batchduration and Process time
    • Dynamic Batch Size

There are many operators in Spark streaming, are there any operators that are expected to be similar to the linear law of time consumption?

For example: Does the time consumption of processing data for join operations and normal map operations present a consistent linear pattern, that is, not the larger the size of the data volume is simply increased batchduration

Can solve the problem, the data volume is one aspect, the computation operator is also a consideration factor.

  

Use BatchSize to adapt our stream handlers:

As online handlers become more and more important, as the flow of data becomes larger, a traditional machine cannot accommodate the incoming data at this moment and process the incoming data at this point, so it needs to be

cloth type. the source of the distributed stream handler is the data that flows in this 1 second. We cannot accommodate a single machine and can not complete timely processing, that is, big data, not to mention real-time processing and

line processing. the most important issue in the process of data processing is the impact of different operators and workloads on our processing time and whether this effect is the result of our expectations.

   There are a number of computational frameworks, all of which have a common feature, which is to use the mapreduce idea in a continuous stream of data to process the received

data, MapReduce is a thought, whether Hadoop or spark is the realization of the idea of mapreduce, the implementation of MapReduce has a very good aspect is the fault tolerance, he has self -

your own set of finished the whole fault tolerance mechanism. The ability to quickly recover from errors with the MapReduce fault-tolerant mechanism is available when the stream handler is processing the data on the line.

Build a stable set of processes, there are many dimensions to consider, such as real-time, peak, such as the processing of 1G of data per second, suddenly a crest need to process 100T of data, at this time how to place

Management, Whole How should streaming deal with this situation?

In the previous stream processing system, one kind of flow processing framework can dynamically adjust resources, such as memory, CPU and so on. The other is to discard some of the data when it is too late to process it. How is that guaranteed

number according to the integrity of the case, and the data will certainly be processed. How to deal with the situation of real emergencies, if the direct adjustment of memory, CPU and other resources is very expensive and not very good adjustment.

  The processing model for Spark streaming is to queue the data for each batchduration in the queue with batch as a model and then continuously:

  

   The data batch of Spark streaming is placed in the queue, then one by one in the cluster, whether the data itself or the metadata, the job is to queue to obtain information to control the entire

operation of a job. as the scale of data becomes more and more large, it is not simple to increase the memory, CPU and other hardware resources can be, because many times is not a linear law changes.

  What causes the delay of batch processing data:

01. Receive data and put the received data into batch queue (that is, batchsize will greatly affect its delay)

02. Waiting Time

03. Processing Time

 Static processing model:

  

    The dotted line in the figure is the security zone, where the data flow comes in time to be digested in this batchduration. Comparing the operations of reduce to join, different operators have different

linear Law, not as the amount of data increases in linear processing speed, there are many factors affecting the flow processing.

Generally use a few batchduration flow processing, directly configure some parameters, every 10S there is a batchduration and then processing, such processing is not advisable. As can be seen, the real

not this , as the amount of data changes, the original data volume runs well, and is expected to be evaluated, such as processing 100M of data per second (single node), using a linear method to evaluate what is at 500M, and then set

static mode corresponding to the set phase is based on your existing hardware resources (memory, CPU, network), so the evaluation is inaccurate and difficult to predict, because it is difficult to predict the performance of the consumption data when it is different.

When changing its data rate, the state has instability, if it can change the relative stability of batchsize, so it is necessary to design an algorithm or implementation, not to adjust memory, CPU and other hardware resources, but tune

Whole its batch size, when batch is small or small enough to be appropriate, it should be a better idea, low latency, flexibility, versatility, simplicity.

  

   To complete the changes in the batchsize of the constant adjustment of job information must be statistical, dynamic adjustment of this mode, this mode is to configure the corresponding parameters. As the processing continues to run, the next run

before looking at a The last statistical information, whether we need to adjust our model, but this will be more difficult. Because there are some non-linear behaviors, it is possible to change the linear resources you think, to deal with

scale is different, processing calculates not the same, there are many unpredictable factors, need to achieve the dynamic adjustment of batchsize.

  

Note:

      • Data from: Liaoliang (Spark release version customization)
      • Sina Weibo:http://www.weibo.com/ilovepains

Dynamic batch size depth and Ratecontroller resolution in Spark streaming

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.