Spark Release Notes 10

Last Update:2016-05-21 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Thanks to DT Big Data DreamWorks support for providing technical support, DT Big Data DreamWorks specializes in spark release customization.

Overview of this issue:

Reflection on the whole life cycle of data receiving

In the Big Data processing framework, the most important thing is performance, performance is in front of. Second, consider the other. Because of the large amount of data, accidentally redundant operation, a few minutes, more than 10 minutes passed.

According to the general architecture design principle, receiving data and storing data are different objects to accomplish.

Spark Streaming Data reception lifecycle can be seen as an MVC pattern, receiversupervisor equivalent to the Controller (c), receiver (v)

The first thing to start is receivertracker.

Turn on communication and start receiver execution thread
Start a receiver along with its scheduled executors

Get the receivers from the Receiverinputdstreams, distributes them to the

* Worker nodes as a parallel collection, and runs them.

It is important to note that receiver is serializable to communicate

It is important to note that the main code for message communication between Receiversupervisor and Receivertracker is as follows

/** divides received data records into data blocks for pushing in Blockmanager. */

The call OnStart () method here is preceded by the receiver's OnStart () method, because receiver's onstart () method uses the value called OnStart () to initialize the Blockgenerator and so on.

* Note:do not create blockgenerator instances directly inside receivers. Use

* ' Receiversupervisor.createblockgenerator ' to create a blockgenerator and use it.

Here vividly illustrates that a blockgenerator only serves a dstream

Receiver receive data should be non-blocking, so you should open a separate thread to execute

By default, a block is generated every 200 milliseconds, and there is a best practice in the production environment, That is the performance tuning when Spark.streaming.blockInterval best not less than 50 milliseconds, because in general, the resulting fragmentation of small files too much, too many handles occupy memory or disk space, resulting in performance degradation, of course, depending on the different data flow rate of different, the most optimized setting how much The time data is merged into a block that is different. To be specific analysis according to specific circumstances. In principle, the resulting file size is balanced between the speed and the number of handles.

Push data to disk (Block) every 10 milliseconds

Send a message to start all receivers

/**

* Start a receiver along with its scheduled executors initiates the scheduled receiver

Private Def startreceiver (

Receiver:receiver[_],

Scheduledlocations:seq[tasklocation]): Unit = {

def Shouldstartreceiver:boolean = {

It ' s okay to start when trackerstate is Initialized or Started

! (istrackerstopping | | istrackerstopped)

}

Val Receiverid = Receiver.streamid

if (!shouldstartreceiver) {

Onreceiverjobfinish (Receiverid)

Return

}

Val checkpointdiroption = Option (Ssc.checkpointdir)

Val serializablehadoopconf =

New Serializableconfiguration (Ssc.sparkContext.hadoopConfiguration)

Function to start the receiver on the worker node

Val Startreceiverfunc:iterator[receiver[_]] = Unit =

(Iterator:iterator[receiver[_]]) = = {

if (!iterator.hasnext) {

throw New Sparkexception (

"Could not start receiver as object not found.")

}

if (Taskcontext.get (). Attemptnumber () = = 0) {

Val receiver = Iterator.next ()

ASSERT (Iterator.hasnext = = False)

Val Supervisor = new Receiversupervisorimpl (

Receiver, Sparkenv.get, Serializablehadoopconf.value, checkpointdiroption)

Supervisor.start ()

Supervisor.awaittermination ()

} else {

It ' s restarted by TaskScheduler, but we want to reschedule it again. So exit it.

}

Spark Release Notes 10

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Spark Release Notes 10

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Spark Release Notes 10

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support