Pyspark the JVM-side Scala code Pythonrdd
Code version for Spark 2.2.0
1.pythonrdd.class
This RDD type is the key to Python's access to spark
This is a standard RDD implementation, the implementation of the corresponding Compute,partitioner,getpartitions method//This pythonrdd is Pyspark Pipelinedrdd _jrdd property method returned by// The parent is the _PREV_JRDD that is passed in Pipelinedrdd, the data source originally built Rddprivate[spark] class Pythonrdd (parent:rdd[_),//This parentrdd is the key, Python uses all the data sources of spark from here to func:pythonfunction,//This is a user-implemented Python compute logic Preservepartitoning:boolean) extends Rdd[arr Ay[byte]] (parent) {val buffersize = conf.getint ("Spark.buffer.size", 65536) val reuse_worker = Conf.getboolean ("SPARK.P Ython.worker.reuse ", True) override def getpartitions:array[partition] = firstparent.partitions override Val Partitione R:option[partitioner] = {if (preservepartitoning) Firstparent.partitioner else None} val asjavardd:javardd[array[ Byte]] = Javardd.fromrdd (this) override Def compute (split:partition, Context:taskcontext): Iterator[array[byte]] = { Call Pythonrunner to perform the task logic here//This pythonrunner with Spark-submit when the Pythonrunner is not the same thing Val runner = Pythonrunner (func, buf Fersize, Reuse_worker)//Perform runner calculation logic,The first parameter is the result of the calculation of the//firstparent.iterator of the spark data source RDD, which triggers the calculation of the rdd of the parent, and returns the calculation result//The RDD of the first parameter here is the same thing as the _jrdd in the Pyspark. Ner.compute (Firstparent.iterator (split, context), Split.index, Context)}}
2.pythonrunner.class
This class is an entity calculation class when performing calculations inside the RDD, not the Pythonrunner of the startup py4j when the code is submitted.
/* * This class does three things * 1. Start Pyspark.daemon Receive task start work execute Received task * 2. Start Writerthread writes the results of the data source to Pyspark.work * 3. Pull execution results from Pyspark.work * * Writerthread written data is the result of _jrdd calculation in Pyspark, i.e. data source RDD data */private[spark] class Pythonrunner (fun Cs:seq[chainedpythonfunctions], Buffersize:int, Reuse_worker:boolean, Isudf:boolean, Argoffsets:array[arr Ay[int]]) extends Logging {require (Funcs.length = Argoffsets.length, "Argoffsets should has the same length as Funcs" )//python execution Environment and commands private val envvars = Funcs.head.funcs.head.envVars private val pythonexec = FUNCS.HEAD.FUNCS.HEAD.P Ythonexec private Val pythonver = Funcs.head.funcs.head.pythonVer private val accumulator = Funcs.head.funcs.head.accumu Lator def COMPUTE (Inputiterator:iterator[_], Partitionindex:int, context:taskcontext): iterator[array[ Byte]] = {val StartTime = system.currenttimemillis val env = sparkenv.get val localdir = env.blockManager.diskBl OckManager.localDirs.map (f = f.getpatH ()). Mkstring (",") Envvars.put ("Spark_local_dirs", Localdir)//It ' s also used in monitor thread if (reuse_worker) { Envvars.put ("Spark_reuse_worker", "1")}//Create Pyspark work process, the bottom of the execution is Pyspark.daemon//This method ensures that a task starts only one PYSP Ark.daemon//Return result is a Socket//specific analysis with work communication will be recorded in the other section val worker:socket = Env.createpythonworker (Pythonexec, Envvars. Asscala.tomap) @volatile var released = false//create Writerthread, write data source data to socket, send to pyspark.work val writerthread = New Writerthread (env, worker, Inputiterator, Partitionindex, context)//Register task complete Listener, stop Writerthread thread after completion CONTEXT.ADDT Askcompletionlistener {context = Writerthread.shutdownontaskcompletion () if (!reuse_worker | |!released) { try {worker.close ()} catch {case e:exception = logwarning ("Failed to CL OSE worker socket ", E)}}} Writerthread.start () New Monitorthread (env, worker, context). Start () Val stream = new DatainpUtstream (New Bufferedinputstream (Worker.getinputstream, buffersize))//create iterator for pull pyspark.work execution result Val stdoutiterator = new Iterator[array[byte]] {override def next (): Array[byte] = {Val obj = _nextobj if (hasnext) { _nextobj = Read ()} obj} private def read (): Array[byte] = {if (writerthread.excep tion.isdefined) {Throw writerThread.exception.get} try {Stream.readint () match { Case length If length > 0 = val obj = new Array[byte] (length) stream.readfully (obj) Obj case 0 = array.empty[byte] case speciallengths.timing_data =/ /Timing data from worker val Boottime = Stream.readlong () val inittime = Stream.readlong () Val finishtime = Stream.readlong () val boot = boottime-starttime val init = inittime -BoottimeVal finish = finishtime-inittime val total = Finishtime-starttime Loginfo ("Times:total =%s , boot =%s, init =%s, finish =%s ". Format (total, boot, init, finish)) Val memorybytesspilled = Stream.readlong () val diskbytesspilled = Stream.readlong () context.taskMetrics.incMemoryBytes Spilled (memorybytesspilled) context.taskMetrics.incDiskBytesSpilled (diskbytesspilled) Read () Case Speciallengths.python_exception_thrown =//signals that an EXCEPTION have been thrown in PYT Hon val exlength = Stream.readint () val obj = new Array[byte] (exlength) stream.read Fully (obj) throw new Pythonexception (new String (obj, standardcharsets.utf_8), Writerthread.exc Eption.getorelse (NULL)) Case speciallengths.end_of_data_section =//We ' ve finished the DATA s Ection of the output, BUT we can still//read some accumulator updates:val numaccumulatorupdates = Stream.readint () (1 to Numaccumulatorupdates). foreach {_ = val Updatelen = stream.readint () v Al update = new Array[byte] (Updatelen) stream.readfully (update) accumulator.add (update) }//Check whether the worker is a ready-to-be re-used. if (stream.readint () = = Speciallengths.end_of_stream) {if (reuse_worker) {ENV.RELEASEPYT Honworker (Pythonexec, EnvVars.asScala.toMap, worker) released = True}} NULL}} catch {case e:exception if context.isinterrupted = Logdebug ("E Xception thrown after task interruption ", e) throw new Taskkilledexception (Context.getkillreason (). Getorelse (" U Nknown reason ")) Case E:exception if env.isstopPED = Logdebug ("Exception thrown after context is stopped", E) NULL//Exit silently C ASE e:exception If writerThread.exception.isDefined = LogError ("Python worker exited unexpectedly (crashed ) ", E) logError (" This may has been caused by a prior exception: ", WriterThread.exception.get) throw WriterThread.exception.get Case eof:eofexception = throw new Sparkexception ("Python worker Exite D unexpectedly (crashed) ", EOF)}} var _nextobj = Read () override def Hasnext:boolean = _nextobj! = null}//return this pull data result iterator new Interruptibleiterator (context, Stdoutiterator)}/** * Writerthread thread Implementation Code * /Class Writerthread (Env:sparkenv, Worker:socket, Inputiterator:iterator[_], Partitionindex:int, Context:taskcontext) extends Thread (S "stdout writer for $pythonExec") {@volatile private var _exception:exc Eption = NULL Private Val pythonincludes = Funcs.flatmap (_.funcs.flatmap (_.pythonincludes.asscala)). Toset private Val broadcastvars = Funcs. FlatMap (_.funcs.flatmap (_.broadcastvars.asscala)) Setdaemon (true)/** Contains the exception thrown while writing th E parent iterator to the Python process. */def Exception:option[exception] = Option (_exception)/** terminates the writer thread, ignoring any exceptions T Hat may occur due to cleanup. */def shutdownontaskcompletion () {assert (context.iscompleted) this.interrupt ()}//main logic in run, data source Rdd The execution results are written in//to write the broadcast variables and environment, as well as the execution logic code of Python//write the data source data that needs to be computed. Override Def run (): Unit = utils.loguncaughtexceptions {try {taskcontext.settaskcontext (context) Val stream = new Bufferedoutputstream (Worker.getoutputstrea M, buffersize) val dataout = new DataOutputStream (stream)//Partition index Dataout.writeint (partiti Onindex)//Python version of driver Pythonrdd.writeutf (PythoNver, Dataout)//Write out the Taskcontextinfo dataout.writeint (Context.stageid ()) Dataout.writeint ( Context.partitionid ()) Dataout.writeint (Context.attemptnumber ()) Dataout.writelong (Context.taskattemptid ()) Sparkfilesdir Pythonrdd.writeutf (Sparkfiles.getrootdirectory (), dataout)//Python includes (*.zip and *.egg files) dataout.writeint (pythonincludes.size) for (include <-pythonincludes) {Python Rdd.writeutf (Include, dataout)}//Broadcast variables val oldbids = pythonrdd.getworkerbroadcasts (w Orker) Val newbids = Broadcastvars.map (_.id). Toset//number of different broadcasts Val Toremove = O Ldbids.diff (newbids) Val cnt = toremove.size + Newbids.diff (oldbids). Size Dataout.writeint (CNT) for ( Bid <-Toremove) {//Remove the broadcast from worker Dataout.writelong (-bid-1)//bid >= 0 Oldbids.remove (bID)} for (broadcast <-broadcastvars) {if (!oldbids.contains (broadcast.id)) {//SE nd new broadcast Dataout.writelong (broadcast.id) Pythonrdd.writeutf (Broadcast.value.path, Dataout) Oldbids.add (Broadcast.id)}} dataout.flush ()//serialized Command:if (IS UDF) {dataout.writeint (1) dataout.writeint (funcs.length) funcs.zip (argoffsets). foreach {case (chained, offsets) = Dataout.writeint (offsets.length) Offsets.foreach {offset = Dataout.writeint (offset)} dataout.writeint (Chained.funcs.length) Chained.funcs.foreach {f = dataout.writeint (f.command.length) dataout.write (F.command)}} } else {dataout.writeint (0) Val command = Funcs.head.funcs.head.command Dataout.writein T (command.length) Dataout.write (Command)}//Data values Pythonrdd.writeiteratortostream (Inputiterator, Dataout) Dataout.writeint (speciallengths.end_of_data_section) dataout.writeint (speciallengths.end_of_stream) DATA Out.flush ()} catch {case e:exception if context.iscompleted | | context.isinterrupted = LOGDEBU G ("Exception thrown after task completion (likely due to cleanup)", E) if (!worker.isclosed) {UTILS.T Rylog (Worker.shutdownoutput ())} Case e:exception =//We must avoid throwing exceptions her E, because the thread uncaught exception handler//would kill the whole executor (see ORG.APACHE.SPARK.EXECUTOR.E Xecutor). _exception = e if (!worker.isclosed) {Utils.trylog (Worker.shutdownoutput ())}}}} Monitor if task is still executing class monitorthread (env:sparkenv, Worker:socket, Context:taskcontext) extends Thread (s "Worker Mo Nitor For $pythonExec ') {Setdaemon (true) override Def run () {//Kill the worker if it is interrupted, checking unt Il task completion. Todo:this have a race condition if interruption occurs, as completed may still become true. while (!context.isinterrupted &&!context.iscompleted) {thread.sleep (+)} if (!context.iscomp leted) {try {logwarning ("Incomplete task interrupted:attempting to kill Python Worker") env.de Stroypythonworker (Pythonexec, EnvVars.asScala.toMap, worker)} catch {case e:exception = LogError ("Exception when trying to kill worker", E)}}}}
Pyspark corresponding Scala code Pythonrdd class