The main content of this section:
I. Data acceptance architecture and design patterns
Second, the acceptance of the data source interpretation
Spark streaming continuously receives data, with receiver's spark application in mind.
Receiver and driver in different processes, receiver to receive data after the continuous reporting to deriver.
Because driver is responsible for scheduling, receiver received data if not reported to the Deriver,deriver dispatch will not calculate the received data into the scheduling system (such as: Data Id,block shards).
Think spark streaming receives data:
Continuously has the circulator receives the data, receives the data to store the data, will store the data to report to the Deriver, receives the data and the storage data should not give the same object processing.
The Spark streaming receives data from design mode to the MVC architecture:
V: That's driver.
M: It's receiver.
C: That's receiversupervisor.
Because:
Receiver is the receiving data, for example: You can get the data from the Sockettextstream.
Receiversupervisor is the controller that stores the data, because receiver is started by Receiversupervisor, In turn, receiver will store data via Receiversupervisor after receiving the data.
The stored meta-data is then reported to the driver side.
V: Is driver, manipulating metadata through metadata pointers, manipulating the specific data content on other machines based on the pointer address, and showing the results of the processing.
So:
Spark Streaming Data reception lifecycle can be seen as an MVC pattern, receiversupervisor equivalent to controller (C), receiver (M), Driver (V)
SOURCE Analysis:
1. Receiver class:
/**
*:: Developerapi::
* Abstract class of a receiver that can is run on worker nodes to receive external data. A
* Custom receiver can be defined by defining the functions ' OnStart () ' and ' onStop () '. ' OnStart () '
* Should define the setup steps necessary to start receiving data,
* and ' onStop () ' should define the cleanup steps necessary to stop receiving data.
* Exceptions While receiving can is handled either by restarting the receiver with ' restart (...) `
* or stopped completely by ' stop (...) ' OR
*
* A custom Receiver in Scala would.
*
* {{{
* Class Myreceiver (Storagelevel:storagelevel) extends networkreceiver[string] (storagelevel) {
* Def OnStart () {
*//Setup stuff (Start threads, open sockets, etc.) to start receiving data.
*//Must start new thread to receive data, as OnStart () must is non-blocking.
*
*//Call Store (...) in those threads to store received data into Spark ' s memory.
*
*//Call Stop (...), restart (...) or reportError (...) on any thread based on how
*//different errors needs to be handled.
*
*//See corresponding method documentation for more details
* }
*
* Def onStop () {
*//Cleanup stuff (Stop threads, close sockets, etc.) to stop receiving data.
* }
* }
* }}}
*
* A custom Receiver in Java would.
*
* {{{
* Class Myreceiver extends Receiver<string> {
* Public Myreceiver (Storagelevel storagelevel) {
* SUPER (Storagelevel);
* }
*
* public void OnStart () {
*//Setup stuff (Start threads, open sockets, etc.) to start receiving data.
*//Must start new thread to receive data, as OnStart () must is non-blocking.
*
*//Call Store (...) in those threads to store received data into Spark ' s memory.
*
*//Call Stop (...), restart (...) or reportError (...) on any thread based on how
*//different errors needs to be handled.
*
*//See corresponding method documentation for more details
* }
*
* public void OnStop () {
*//Cleanup stuff (Stop threads, close sockets, etc.) to stop receiving data.
* }
* }
* }}}
*/
@DeveloperApi
Abstract class Receiver[t] (Val storagelevel:storagelevel) extends Serializable {
2, Receiversupervisor class:
/**
* Abstract class is responsible for supervising a Receiver in the worker.
* It provides all the necessary interfaces for handling the data received by the receiver.
*/
Private[streaming] abstract class Receiversupervisor (
Receiver:receiver[_],
Conf:sparkconf
) extends Logging {
Receivertracker sends a job, each job has a task, and each task has a receiversupervisor to start each receiver, Look at the start method of Receivertracker:
/**
* Management receiver: Start, execute, restart
* Identify all input stream records and have members record all input sources
* An input stream is required to initiate a receiver for each input stream
* This class manages the execution of the receivers of Receiverinputdstreams. Instance of
* This class must is created after all input streams has been added and Streamingcontext.start ()
* have been called because it needs the final set of input streams at the time of instantiation.
*dirver End
* @param skipreceiverlaunch do not launch the receiver. This was useful for testing.
*/
Private[streaming]
Class Receivertracker (Ssc:streamingcontext, Skipreceiverlaunch:boolean = False) extends Logging {
Private Val receiverinputstreams = Ssc.graph.getReceiverInputStreams ()
Private Val receiverinputstreamids = receiverinputstreams.map {_.id}
Private Val receivedblocktracker = new Receivedblocktracker (
Ssc.sparkContext.conf,
Ssc.sparkContext.hadoopConfiguration,
Receiverinputstreamids,
Ssc.scheduler.clock,
Ssc.ischeckpointpresent,
Option (Ssc.checkpointdir)
)
Private Val Listenerbus = Ssc.scheduler.listenerBus
/** enumeration to identify current state of the Receivertracker * *
Object Trackerstate extends Enumeration {
Type trackerstate = Value
Val Initialized, Started, stopping, Stopped = Value
}
Import Trackerstate._
/** State of the tracker. Protected by "Trackerstatelock" */
@volatile private var trackerstate = Initialized
Endpoint is created when generator starts.
This is not being null means the tracker have been started and not stopped
private var endpoint:rpcendpointref = null
Private Val schedulingpolicy = new Receiverschedulingpolicy ()
Track the active receiver job number. When a receiver job exits ultimately, Countdown would
be called.
Private Val Receiverjobexitlatch = new Countdownlatch (receiverinputstreams.size)
/**
* Track all receivers ' information. The key is the receiver ID and the value is the receiver info.
* It ' s only accessed in Receivertrackerendpoint.
*/
Private Val Receivertrackinginfos = new Hashmap[int, Receivertrackinginfo]
/**
* Store all preferred locations for all receivers. We need this information to schedule
* Receivers. It ' s only accessed in Receivertrackerendpoint.
*/
Private Val receiverpreferredlocations = new Hashmap[int, option[string]]
/** Start The endpoint and receiver execution thread. */
def start (): Unit = synchronized {
if (istrackerstarted) {
throw new Sparkexception ("Receivertracker already Started")
}
if (!receiverinputstreams.isempty) {
Endpoint = Ssc.env.rpcEnv.setupEndpoint (
"Receivertracker", New Receivertrackerendpoint (SSC.ENV.RPCENV))
if (!skipreceiverlaunch) launchreceivers ()
Loginfo ("Receivertracker started")
Trackerstate = Started
}
}
The elements in the RDD must be serialized to transmit the RDD to the executor end, receiver implements the serializable interface, and the custom receiver must implement the serializable interface.
@DeveloperApi
Abstract classReceiver[t] (ValStoragelevel:storagelevel)extendsSerializable {
Processing receiver received data, storing data and reporting to Driver,receiver is a one-piece receiving data.
function in the RDD, inside is a receiver, the code inside need to start receiver who, according to the data source you entered Inputdstreams Receiver,sockettextstream
Equivalent to a reference handle Socketreceiver, the receiver we get is a reference to the description, and the received data is generated by the following getreceiver:
/**
* Get The receivers from the Receiverinputdstreams, distributes them to the
* worker nodes as a parallel Collection, and runs them.
*/
Private def launchreceivers (): Unit = {
Val receivers = Receiverinputstreams.map (NIS = {
Val rc VR = Nis.getreceiver ()
Rcvr.setreceiverid (nis.id)
RCVR
})
Rundummysparkjob ()
Loginfo ( "Starting" + Receivers.length + "receivers")
Endpoint.send (Startallreceivers (receivers))
}
Private[streaming]
class Receivertracker (Ssc:streamingcontext, Skipreceiverlaunch:boolean = False) extends Logging {
Private val receiverinputstreams = Ssc.graph.getReceiverInputStreams ()
Private Val Receiverinputstrea MIds = receiverinputstreams.map {_.id}
private val receivedblocktracker = new Receivedblocktracker (
Ssc.spar kcontext.conf,
Ssc.sparkContext.hadoopConfiguration,
Receiverinputstreamids,
Ssc.scheduler.clock,
Ssc.ischeckpointpresent,
Option (ssc.checkpointdir)
)
Private[streaming]
class Socketinputdstream[t:classtag] (
ssc_: StreamingContext,
Host:string,
Port:int,
Bytestoobjects:inputstream = iterator[t],
Storagelevel:storagelevel
) extends Rece Iverinputdstream[t] (ssc_) {
def getreceiver (): receiver[t] = {
New Socketreceiver (host, Port, BYTESTOOBJEC TS, storagelevel)
}
}
Private[streaming]
class Socketreceiver[t:classtag] (
host:string,
Port:int,
Bytestoobjects:inputstream = iterator[t],
Storagelevel:storagelevel
) extends Receiver T (Storagelevel) with Logging {
def onStart () {
//Start of the thread that receives data over a connection
New Thread ("Socket Receiver") {
Setdaemon (true)
Override Def run () {receive ()}
}.start ()
}
If the receiver Rdd is empty, an RDD is created by default, primarily processing the data received by receiver, receiving data to receiversupervisor for storing data, and reporting the metadata to Receivertracker,receiver Receiving data is a strip of data, which, from an abstraction, is a while loop. Receive data, merge into buffer, put into block queue, Receiversupervisorimpl start will call the Blockgenerator object's Start method.
Override protected Def onStart () {
Registeredblockgenerators.foreach {_.start ()}
}
/**
* Generates batches of objects received by a
* [[Org.apache.spark.streaming.receiver.Receiver]] and puts them into appropriately
* Named blocks at regular intervals. This class starts the threads,
* One to periodically start a new batch and prepare the previous batch of as a block,
* The other-to-push the blocks into the block manager.
*
* Note:do not create blockgenerator instances directly inside receivers. Use
* ' Receiversupervisor.createblockgenerator ' to create a blockgenerator and use it.
*/
Private[streaming] class Blockgenerator (
Listener:blockgeneratorlistener,
Receiverid:int,
Conf:sparkconf,
Clock:clock = new Systemclock ()
) extends Ratelimiter (conf) with Logging {
Private Case Class Block (Id:streamblockid, Buffer:arraybuffer[any])
/**
* The Blockgenerator can is in 5 possible states and in the order as follows.
*
*-Initialized:nothing has been started
*-Active:start () has been called, and it's generating blocks on added data.
*-Stoppedaddingdata:stop () has been called, the adding of data have been stopped,
* But blocks is still being generated and pushed.
*-stoppedgeneratingblocks:generating of blocks have been stopped, but
* They is still being pushed.
*-Stoppedall:everything have stopped, and the Blockgenerator object can be GCed.
*/
Private object Generatorstate extends enumeration {
Type generatorstate = Value
Val Initialized, Active, Stoppedaddingdata, stoppedgeneratingblocks, Stoppedall = Value
}
Import Generatorstate._
Private Val Blockintervalms = conf.gettimeasms ("Spark.streaming.blockInterval", "200ms")
Require (Blockintervalms > 0, S "' Spark.streaming.blockInterval ' should be a positive value")
Private Val Blockintervaltimer =
New Recurringtimer (Clock, Blockintervalms, Updatecurrentbuffer, "Blockgenerator")
Private Val blockqueuesize = Conf.getint ("Spark.streaming.blockQueueSize", 10)
Private Val blocksforpushing = new Arrayblockingqueue[block] (blockqueuesize)
Private Val blockpushingthread = new Thread () {override def run () {keeppushingblocks ()}}
@volatile private var currentbuffer = new Arraybuffer[any]
@volatile private var state = Initialized
/** Start block generating and pushing threads. */
def start (): Unit = synchronized {
if (state = = Initialized) {
State = Active
Blockintervaltimer.start ()
Blockpushingthread.start ()
Loginfo ("Started blockgenerator")
} else {
throw New Sparkexception (
S "Cannot start blockgenerator as its not at the Initialized state [state = $state]")
}
}
What is the Blockgenerator class used for? From the above source code annotations can be described in this class to combine a receiver received data into a block and then write to the Blockmanager object.
There are two threads inside the class, one that generates a batch of objects periodically, and then encapsulates the previous batch of data into blocks. The other thread writes the block to Blockmanager for storage.
Override Def Createblockgenerator (
Blockgeneratorlistener:blockgeneratorlistener): Blockgenerator = {
Cleanup blockgenerators that has already been stopped
Registeredblockgenerators--= registeredblockgenerators.filter{_.isstopped ()}
Val newblockgenerator = new Blockgenerator (Blockgeneratorlistener, Streamid, env.conf)
Registeredblockgenerators + = Newblockgenerator
Newblockgenerator
}
The Blockgenerator class inherits from the Reatelimiter class, which shows that we cannot limit the speed of receiving data, but we can limit the speed of storing data and turn around to limit the speed of the flow.
The Blockgenerator class has a timer (by default every 200ms will merge the received data into blocks) and a thread (writes block to Blockmanager), 200ms generates a block, which generates 5 partition in 1 seconds. Too small, the resulting data slice is too small, resulting in a task processing less data and poor performance. Actual experience get no less than 50ms.
Private Val Blockintervalms = conf.gettimeasms ("Spark.streaming.blockInterval", "200ms")
Require (Blockintervalms > 0, S "' Spark.streaming.blockInterval ' should be a positive value")
Private Val Blockintervaltimer =
New Recurringtimer (Clock, Blockintervalms, Updatecurrentbuffer, "Blockgenerator")
Private Val blockqueuesize = Conf.getint ("Spark.streaming.blockQueueSize", 10)
Private Val blocksforpushing = new Arrayblockingqueue[block] (blockqueuesize)
Private Val blockpushingthread = new Thread () {override def run () {keeppushingblocks ()}}
Thank Liaoliang teacher for their knowledge sharing
Liaoliang Teacher's card:
China Spark first person
Thank Liaoliang teacher for their knowledge sharing
Sina Weibo: Http://weibo.com/ilovepains
Public Number: Dt_spark
Blog: http://blog.sina.com.cn/ilovepains
Mobile: 18610086859
qq:1740415547
Email: [Email protected]
YY classroom: Daily 20:00 live teaching channel 68917580
Spark Release Notes 10:spark streaming source code interpretation flow data receiving and full life cycle thorough research and thinking