Spark Release Notes 9

Source: Internet
Author: User

Thanks to DT Big Data DreamWorks support for providing technical support, DT Big Data DreamWorks specializes in spark release customization.

Overview of this issue:

1 Receiver Life Full cycle

First, we find the entrance to the data source, the entry is as follows


Receiver is extremely ingenious in its design. Its design is very good, a lot of places are worth our serious study.

Before we get to know receiver, we need to think about it, if we don't have spark, we can try to think about it, and receiver is constantly accepting incoming data, and if we do, what do we do? How do I start receiver?

We try to think from the following directions.

The method is shown below

Receiver is part of the application launch, and receiver is one by one corresponding to InputStream when we start receiver. If we start multiple receiver, it doesn't matter if a partition has more than one piece of data. However, there is a problem, from a resource scheduling point of view, it is possible to start multiple receiver from a machine, resulting in unbalanced load, but also may cause receiver failed to start. Because the RDD different shards correspond to different shards. It is possible to executor failures on different machines, causing the task to fail.

We want to ask that as long as our cluster is running, our receiver will run normally. If receiver does not function properly, it will cause the whole cluster task to be unable to execute, which is unacceptable.

Therefore, both of our assumptions are not feasible, the feasible way is that receiver can fail, but does not affect the normal operation of the job. Receiver failure will be fault-tolerant, and will eventually run successfully, then we see how the official spark is how to do such a clever receiver fault-tolerant performance.

In fact, we can think that Inputstreams and receivers are one by one corresponding.

However, this can result in an unbalanced load, because receiver is on a different machine. In addition receiver boot may fail.

So far, we still don't see the code to start receiver, so where is the code that started it?


And then there's the way to start receiver.

This code further proves that a receiver has only one inputstream corresponding to it.


Driver level determines on which executor to perform receiver


Terminating a receiver means there is no need to restart a job

Receiver start does not retry

To start receiver, a spark job was started


The following question is important:

This is to start a job where each receiver initiates a job, or multiple receiver initiates a job. Loop starts each receiver, each receiver initiates a job

In this way, we solve the disadvantage of starting a task to start receiver, each receiver corresponding to a job, corresponding to a task. Minimizing load imbalance does not cause receiver failure to run the entire job. It also has some advantages in solving the task tilt.

The executor will be cut off when the receiver is restarted.

This is a wonderful design that will ensure that receiver can be successfully activated in any case.

Once the task fails, the framework pretends to be restartreceiver, and can be said to be perfectly designed.

Thread pool to start receiver concurrently, because it is possible that the data received by different receivers is not coupled

By now, we depend on a cloud that has not been solved, that is, determine receiver specific on which machine, the code is as follows

The last line of code: Ensure that Executor is alive (default 50 threads, 20 concurrency), as an sparkstreaming application, there is little likelihood of more than 50 data sources.

Spark Release Notes 9

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.