Thanks to DT Big Data DreamWorks support for providing technical support, DT Big Data DreamWorks specializes in spark release customization.
Overview of this issue:
Reflection on the whole life cycle of data receiving
In the Big Data processing framework, the most important thing is performance, performance is in front of. Second, consider the other. Because of the large amount of data, accidentally redundant operation, a few minutes, more than 10 minutes passed.
According to the general architecture design principle, receiving data and storing data are different objects to accomplish.
Spark Streaming Data reception lifecycle can be seen as an MVC pattern, receiversupervisor equivalent to the Controller (c), receiver (v)
The first thing to start is receivertracker.
Turn on communication and start receiver execution thread
Start a receiver along with its scheduled executors
Get the receivers from the Receiverinputdstreams, distributes them to the
* Worker nodes as a parallel collection, and runs them.
It is important to note that receiver is serializable to communicate
It is important to note that the main code for message communication between Receiversupervisor and Receivertracker is as follows
/** divides received data records into data blocks for pushing in Blockmanager. */
The call OnStart () method here is preceded by the receiver's OnStart () method, because receiver's onstart () method uses the value called OnStart () to initialize the Blockgenerator and so on.
* Note:do not create blockgenerator instances directly inside receivers. Use
* ' Receiversupervisor.createblockgenerator ' to create a blockgenerator and use it.
Here vividly illustrates that a blockgenerator only serves a dstream
Receiver receive data should be non-blocking, so you should open a separate thread to execute
By default, a block is generated every 200 milliseconds, and there is a best practice in the production environment, That is the performance tuning when Spark.streaming.blockInterval best not less than 50 milliseconds, because in general, the resulting fragmentation of small files too much, too many handles occupy memory or disk space, resulting in performance degradation, of course, depending on the different data flow rate of different, the most optimized setting how much The time data is merged into a block that is different. To be specific analysis according to specific circumstances. In principle, the resulting file size is balanced between the speed and the number of handles.
Push data to disk (Block) every 10 milliseconds
Send a message to start all receivers
/**
* Start a receiver along with its scheduled executors initiates the scheduled receiver
*/
Private Def startreceiver (
Receiver:receiver[_],
Scheduledlocations:seq[tasklocation]): Unit = {
def Shouldstartreceiver:boolean = {
It ' s okay to start when trackerstate is Initialized or Started
! (istrackerstopping | | istrackerstopped)
}
Val Receiverid = Receiver.streamid
if (!shouldstartreceiver) {
Onreceiverjobfinish (Receiverid)
Return
}
Val checkpointdiroption = Option (Ssc.checkpointdir)
Val serializablehadoopconf =
New Serializableconfiguration (Ssc.sparkContext.hadoopConfiguration)
Function to start the receiver on the worker node
Val Startreceiverfunc:iterator[receiver[_]] = Unit =
(Iterator:iterator[receiver[_]]) = = {
if (!iterator.hasnext) {
throw New Sparkexception (
"Could not start receiver as object not found.")
}
if (Taskcontext.get (). Attemptnumber () = = 0) {
Val receiver = Iterator.next ()
ASSERT (Iterator.hasnext = = False)
Val Supervisor = new Receiversupervisorimpl (
Receiver, Sparkenv.get, Serializablehadoopconf.value, checkpointdiroption)
Supervisor.start ()
Supervisor.awaittermination ()
} else {
It ' s restarted by TaskScheduler, but we want to reschedule it again. So exit it.
}
}
Spark Release Notes 10