Spark Release Notes 9

Last Update:2016-05-21 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Thanks to DT Big Data DreamWorks support for providing technical support, DT Big Data DreamWorks specializes in spark release customization.

Overview of this issue:

1 Receiver Life Full cycle

First, we find the entrance to the data source, the entry is as follows

Receiver is extremely ingenious in its design. Its design is very good, a lot of places are worth our serious study.

Before we get to know receiver, we need to think about it, if we don't have spark, we can try to think about it, and receiver is constantly accepting incoming data, and if we do, what do we do? How do I start receiver?

We try to think from the following directions.

The method is shown below

Receiver is part of the application launch, and receiver is one by one corresponding to InputStream when we start receiver. If we start multiple receiver, it doesn't matter if a partition has more than one piece of data. However, there is a problem, from a resource scheduling point of view, it is possible to start multiple receiver from a machine, resulting in unbalanced load, but also may cause receiver failed to start. Because the RDD different shards correspond to different shards. It is possible to executor failures on different machines, causing the task to fail.

We want to ask that as long as our cluster is running, our receiver will run normally. If receiver does not function properly, it will cause the whole cluster task to be unable to execute, which is unacceptable.

Therefore, both of our assumptions are not feasible, the feasible way is that receiver can fail, but does not affect the normal operation of the job. Receiver failure will be fault-tolerant, and will eventually run successfully, then we see how the official spark is how to do such a clever receiver fault-tolerant performance.

In fact, we can think that Inputstreams and receivers are one by one corresponding.

However, this can result in an unbalanced load, because receiver is on a different machine. In addition receiver boot may fail.

So far, we still don't see the code to start receiver, so where is the code that started it?

And then there's the way to start receiver.

This code further proves that a receiver has only one inputstream corresponding to it.

Driver level determines on which executor to perform receiver

Terminating a receiver means there is no need to restart a job

Receiver start does not retry

To start receiver, a spark job was started

The following question is important:

This is to start a job where each receiver initiates a job, or multiple receiver initiates a job. Loop starts each receiver, each receiver initiates a job

In this way, we solve the disadvantage of starting a task to start receiver, each receiver corresponding to a job, corresponding to a task. Minimizing load imbalance does not cause receiver failure to run the entire job. It also has some advantages in solving the task tilt.

The executor will be cut off when the receiver is restarted.

This is a wonderful design that will ensure that receiver can be successfully activated in any case.

Once the task fails, the framework pretends to be restartreceiver, and can be said to be perfectly designed.

Thread pool to start receiver concurrently, because it is possible that the data received by different receivers is not coupled

By now, we depend on a cloud that has not been solved, that is, determine receiver specific on which machine, the code is as follows

The last line of code: Ensure that Executor is alive (default 50 threads, 20 concurrency), as an sparkstreaming application, there is little likelihood of more than 50 data sources.

Spark Release Notes 9

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Spark Release Notes 9

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Spark Release Notes 9

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support