Pyspark corresponding Scala code Pythonrdd class

Source: Internet
Author: User
Tags diff volatile pyspark

Pyspark the JVM-side Scala code Pythonrdd

Code version for Spark 2.2.0

1.pythonrdd.class

This RDD type is the key to Python's access to spark

This is a standard RDD implementation, the implementation of the corresponding Compute,partitioner,getpartitions method//This pythonrdd is Pyspark Pipelinedrdd _jrdd property method returned by// The parent is the _PREV_JRDD that is passed in Pipelinedrdd, the data source originally built Rddprivate[spark] class Pythonrdd (parent:rdd[_),//This parentrdd is the key, Python uses all the data sources of spark from here to func:pythonfunction,//This is a user-implemented Python compute logic Preservepartitoning:boolean) extends Rdd[arr Ay[byte]] (parent) {val buffersize = conf.getint ("Spark.buffer.size", 65536) val reuse_worker = Conf.getboolean ("SPARK.P Ython.worker.reuse ", True) override def getpartitions:array[partition] = firstparent.partitions override Val Partitione R:option[partitioner] = {if (preservepartitoning) Firstparent.partitioner else None} val asjavardd:javardd[array[    Byte]] = Javardd.fromrdd (this) override Def compute (split:partition, Context:taskcontext): Iterator[array[byte]] = { Call Pythonrunner to perform the task logic here//This pythonrunner with Spark-submit when the Pythonrunner is not the same thing Val runner = Pythonrunner (func, buf Fersize, Reuse_worker)//Perform runner calculation logic,The first parameter is the result of the calculation of the//firstparent.iterator of the spark data source RDD, which triggers the calculation of the rdd of the parent, and returns the calculation result//The RDD of the first parameter here is the same thing as the _jrdd in the Pyspark. Ner.compute (Firstparent.iterator (split, context), Split.index, Context)}}
2.pythonrunner.class

This class is an entity calculation class when performing calculations inside the RDD, not the Pythonrunner of the startup py4j when the code is submitted.

/* * This class does three things * 1. Start Pyspark.daemon Receive task start work execute Received task * 2. Start Writerthread writes the results of the data source to Pyspark.work * 3. Pull execution results from Pyspark.work * * Writerthread written data is the result of _jrdd calculation in Pyspark, i.e. data source RDD data */private[spark] class Pythonrunner (fun Cs:seq[chainedpythonfunctions], Buffersize:int, Reuse_worker:boolean, Isudf:boolean, Argoffsets:array[arr Ay[int]]) extends Logging {require (Funcs.length = Argoffsets.length, "Argoffsets should has the same length as Funcs" )//python execution Environment and commands private val envvars = Funcs.head.funcs.head.envVars private val pythonexec = FUNCS.HEAD.FUNCS.HEAD.P Ythonexec private Val pythonver = Funcs.head.funcs.head.pythonVer private val accumulator = Funcs.head.funcs.head.accumu Lator def COMPUTE (Inputiterator:iterator[_], Partitionindex:int, context:taskcontext): iterator[array[ Byte]] = {val StartTime = system.currenttimemillis val env = sparkenv.get val localdir = env.blockManager.diskBl OckManager.localDirs.map (f = f.getpatH ()). Mkstring (",") Envvars.put ("Spark_local_dirs", Localdir)//It ' s also used in monitor thread if (reuse_worker) { Envvars.put ("Spark_reuse_worker", "1")}//Create Pyspark work process, the bottom of the execution is Pyspark.daemon//This method ensures that a task starts only one PYSP Ark.daemon//Return result is a Socket//specific analysis with work communication will be recorded in the other section val worker:socket = Env.createpythonworker (Pythonexec, Envvars.  Asscala.tomap) @volatile var released = false//create Writerthread, write data source data to socket, send to pyspark.work val writerthread = New Writerthread (env, worker, Inputiterator, Partitionindex, context)//Register task complete Listener, stop Writerthread thread after completion CONTEXT.ADDT        Askcompletionlistener {context = Writerthread.shutdownontaskcompletion () if (!reuse_worker | |!released) { try {worker.close ()} catch {case e:exception = logwarning ("Failed to CL    OSE worker socket ", E)}}} Writerthread.start () New Monitorthread (env, worker, context). Start () Val stream = new DatainpUtstream (New Bufferedinputstream (Worker.getinputstream, buffersize))//create iterator for pull pyspark.work execution result Val stdoutiterator          = new Iterator[array[byte]] {override def next (): Array[byte] = {Val obj = _nextobj if (hasnext) { _nextobj = Read ()} obj} private def read (): Array[byte] = {if (writerthread.excep            tion.isdefined) {Throw writerThread.exception.get} try {Stream.readint () match {              Case length If length > 0 = val obj = new Array[byte] (length) stream.readfully (obj) Obj case 0 = array.empty[byte] case speciallengths.timing_data =/              /Timing data from worker val Boottime = Stream.readlong () val inittime = Stream.readlong () Val finishtime = Stream.readlong () val boot = boottime-starttime val init = inittime              -BoottimeVal finish = finishtime-inittime val total = Finishtime-starttime Loginfo ("Times:total =%s  , boot =%s, init =%s, finish =%s ". Format (total, boot, init, finish)) Val memorybytesspilled = Stream.readlong () val diskbytesspilled = Stream.readlong () context.taskMetrics.incMemoryBytes            Spilled (memorybytesspilled) context.taskMetrics.incDiskBytesSpilled (diskbytesspilled) Read () Case Speciallengths.python_exception_thrown =//signals that an EXCEPTION have been thrown in PYT Hon val exlength = Stream.readint () val obj = new Array[byte] (exlength) stream.read Fully (obj) throw new Pythonexception (new String (obj, standardcharsets.utf_8), Writerthread.exc Eption.getorelse (NULL)) Case speciallengths.end_of_data_section =//We ' ve finished the DATA s Ection of the output, BUT we can still//read some accumulator updates:val numaccumulatorupdates = Stream.readint () (1 to Numaccumulatorupdates). foreach {_ = val Updatelen = stream.readint () v              Al update = new Array[byte] (Updatelen) stream.readfully (update) accumulator.add (update)              }//Check whether the worker is a ready-to-be re-used. if (stream.readint () = = Speciallengths.end_of_stream) {if (reuse_worker) {ENV.RELEASEPYT              Honworker (Pythonexec, EnvVars.asScala.toMap, worker) released = True}} NULL}} catch {case e:exception if context.isinterrupted = Logdebug ("E Xception thrown after task interruption ", e) throw new Taskkilledexception (Context.getkillreason (). Getorelse (" U Nknown reason ")) Case E:exception if env.isstopPED = Logdebug ("Exception thrown after context is stopped", E) NULL//Exit silently C ASE e:exception If writerThread.exception.isDefined = LogError ("Python worker exited unexpectedly (crashed  ) ", E) logError (" This may has been caused by a prior exception: ", WriterThread.exception.get) throw WriterThread.exception.get Case eof:eofexception = throw new Sparkexception ("Python worker Exite D unexpectedly (crashed) ", EOF)}} var _nextobj = Read () override def Hasnext:boolean = _nextobj! = null}//return this pull data result iterator new Interruptibleiterator (context, Stdoutiterator)}/** * Writerthread thread Implementation Code *       /Class Writerthread (Env:sparkenv, Worker:socket, Inputiterator:iterator[_], Partitionindex:int, Context:taskcontext) extends Thread (S "stdout writer for $pythonExec") {@volatile private var _exception:exc Eption = NULL Private Val pythonincludes = Funcs.flatmap (_.funcs.flatmap (_.pythonincludes.asscala)). Toset private Val broadcastvars = Funcs. FlatMap (_.funcs.flatmap (_.broadcastvars.asscala)) Setdaemon (true)/** Contains the exception thrown while writing th E parent iterator to the Python process. */def Exception:option[exception] = Option (_exception)/** terminates the writer thread, ignoring any exceptions T Hat may occur due to cleanup. */def shutdownontaskcompletion () {assert (context.iscompleted) this.interrupt ()}//main logic in run, data source Rdd The execution results are written in//to write the broadcast variables and environment, as well as the execution logic code of Python//write the data source data that needs to be computed. Override Def run (): Unit = utils.loguncaughtexceptions {try {taskcontext.settaskcontext (context) Val stream = new Bufferedoutputstream (Worker.getoutputstrea M, buffersize) val dataout = new DataOutputStream (stream)//Partition index Dataout.writeint (partiti Onindex)//Python version of driver Pythonrdd.writeutf (PythoNver, Dataout)//Write out the Taskcontextinfo dataout.writeint (Context.stageid ()) Dataout.writeint (        Context.partitionid ()) Dataout.writeint (Context.attemptnumber ()) Dataout.writelong (Context.taskattemptid ())  Sparkfilesdir Pythonrdd.writeutf (Sparkfiles.getrootdirectory (), dataout)//Python includes (*.zip and *.egg files) dataout.writeint (pythonincludes.size) for (include <-pythonincludes) {Python Rdd.writeutf (Include, dataout)}//Broadcast variables val oldbids = pythonrdd.getworkerbroadcasts (w Orker) Val newbids = Broadcastvars.map (_.id). Toset//number of different broadcasts Val Toremove = O Ldbids.diff (newbids) Val cnt = toremove.size + Newbids.diff (oldbids). Size Dataout.writeint (CNT) for (          Bid <-Toremove) {//Remove the broadcast from worker Dataout.writelong (-bid-1)//bid >= 0 Oldbids.remove (bID)} for (broadcast <-broadcastvars) {if (!oldbids.contains (broadcast.id)) {//SE            nd new broadcast Dataout.writelong (broadcast.id) Pythonrdd.writeutf (Broadcast.value.path, Dataout) Oldbids.add (Broadcast.id)}} dataout.flush ()//serialized Command:if (IS UDF) {dataout.writeint (1) dataout.writeint (funcs.length) funcs.zip (argoffsets). foreach {case              (chained, offsets) = Dataout.writeint (offsets.length) Offsets.foreach {offset =  Dataout.writeint (offset)} dataout.writeint (Chained.funcs.length) Chained.funcs.foreach        {f = dataout.writeint (f.command.length) dataout.write (F.command)}} } else {dataout.writeint (0) Val command = Funcs.head.funcs.head.command Dataout.writein        T (command.length)  Dataout.write (Command)}//Data values Pythonrdd.writeiteratortostream (Inputiterator, Dataout) Dataout.writeint (speciallengths.end_of_data_section) dataout.writeint (speciallengths.end_of_stream) DATA Out.flush ()} catch {case e:exception if context.iscompleted | | context.isinterrupted = LOGDEBU G ("Exception thrown after task completion (likely due to cleanup)", E) if (!worker.isclosed) {UTILS.T Rylog (Worker.shutdownoutput ())} Case e:exception =//We must avoid throwing exceptions her E, because the thread uncaught exception handler//would kill the whole executor (see ORG.APACHE.SPARK.EXECUTOR.E          Xecutor).   _exception = e if (!worker.isclosed) {Utils.trylog (Worker.shutdownoutput ())}}}} Monitor if task is still executing class monitorthread (env:sparkenv, Worker:socket, Context:taskcontext) extends Thread (s "Worker Mo Nitor For $pythonExec ') {Setdaemon (true) override Def run () {//Kill the worker if it is interrupted, checking unt      Il task completion.      Todo:this have a race condition if interruption occurs, as completed may still become true. while (!context.isinterrupted &&!context.iscompleted) {thread.sleep (+)} if (!context.iscomp leted) {try {logwarning ("Incomplete task interrupted:attempting to kill Python Worker") env.de            Stroypythonworker (Pythonexec, EnvVars.asScala.toMap, worker)} catch {case e:exception = LogError ("Exception when trying to kill worker", E)}}}}

Pyspark corresponding Scala code Pythonrdd class

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.