Spark Release Notes 10

Source: Internet
Author: User

Thanks to DT Big Data DreamWorks support for providing technical support, DT Big Data DreamWorks specializes in spark release customization.

Overview of this issue:

Reflection on the whole life cycle of data receiving

In the Big Data processing framework, the most important thing is performance, performance is in front of. Second, consider the other. Because of the large amount of data, accidentally redundant operation, a few minutes, more than 10 minutes passed.

According to the general architecture design principle, receiving data and storing data are different objects to accomplish.

Spark Streaming Data reception lifecycle can be seen as an MVC pattern, receiversupervisor equivalent to the Controller (c), receiver (v)

The first thing to start is receivertracker.

Turn on communication and start receiver execution thread
Start a receiver along with its scheduled executors

Get the receivers from the Receiverinputdstreams, distributes them to the

* Worker nodes as a parallel collection, and runs them.


It is important to note that receiver is serializable to communicate

It is important to note that the main code for message communication between Receiversupervisor and Receivertracker is as follows

/** divides received data records into data blocks for pushing in Blockmanager. */

The call OnStart () method here is preceded by the receiver's OnStart () method, because receiver's onstart () method uses the value called OnStart () to initialize the Blockgenerator and so on.

* Note:do not create blockgenerator instances directly inside receivers. Use

* ' Receiversupervisor.createblockgenerator ' to create a blockgenerator and use it.

Here vividly illustrates that a blockgenerator only serves a dstream

Receiver receive data should be non-blocking, so you should open a separate thread to execute

By default, a block is generated every 200 milliseconds, and there is a best practice in the production environment, That is the performance tuning when Spark.streaming.blockInterval best not less than 50 milliseconds, because in general, the resulting fragmentation of small files too much, too many handles occupy memory or disk space, resulting in performance degradation, of course, depending on the different data flow rate of different, the most optimized setting how much The time data is merged into a block that is different. To be specific analysis according to specific circumstances. In principle, the resulting file size is balanced between the speed and the number of handles.

Push data to disk (Block) every 10 milliseconds

Send a message to start all receivers

/**

* Start a receiver along with its scheduled executors initiates the scheduled receiver

*/

Private Def startreceiver (

Receiver:receiver[_],

Scheduledlocations:seq[tasklocation]): Unit = {

def Shouldstartreceiver:boolean = {

It ' s okay to start when trackerstate is Initialized or Started

! (istrackerstopping | | istrackerstopped)

}

Val Receiverid = Receiver.streamid

if (!shouldstartreceiver) {

Onreceiverjobfinish (Receiverid)

Return

}

Val checkpointdiroption = Option (Ssc.checkpointdir)

Val serializablehadoopconf =

New Serializableconfiguration (Ssc.sparkContext.hadoopConfiguration)

Function to start the receiver on the worker node

Val Startreceiverfunc:iterator[receiver[_]] = Unit =

(Iterator:iterator[receiver[_]]) = = {

if (!iterator.hasnext) {

throw New Sparkexception (

"Could not start receiver as object not found.")

}

if (Taskcontext.get (). Attemptnumber () = = 0) {

Val receiver = Iterator.next ()

ASSERT (Iterator.hasnext = = False)

Val Supervisor = new Receiversupervisorimpl (

Receiver, Sparkenv.get, Serializablehadoopconf.value, checkpointdiroption)

Supervisor.start ()

Supervisor.awaittermination ()

} else {

It ' s restarted by TaskScheduler, but we want to reschedule it again. So exit it.

}

}

Spark Release Notes 10

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.