Pyspark the JVM-side Scala code Pythonrdd
Code version for Spark 2.2.0
1.pythonrdd.object
This static class is a base entry for Pyspark
This does not introduce the entire content of this class, because most of them are static interfaces, called by the Pyspark Code///Here are some of the main functions// The Collectandserver method called by the Collect method that is the base of all actions in the Pyspark RDD is also defined in this object Private[spark] object Pythonrdd extends Logging {//Be pyspark. Sparkcontext.runjob Call//provide rdd.collect function, submit job Def runjob (Sc:sparkcontext, Rdd:javardd[array[byte]], p Artitions:jarraylist[int]): Int = {Type ByteArray = Array[byte] Type unrolledpartition = Array[bytearray] val A Llpartitions:array[unrolledpartition] = Sc.runjob (Rdd, (x:iterator[bytearray)) = X.toarray, Partitions.asscala) Val flattenedpartition:unrolledpartition = Array.concat (allpartitions: _*) serveiterator (FlattenedPartition.iterat Or, S "Serve RDD ${rdd.id} with Partitions ${partitions.asscala.mkstring (", ")}")}//Entire pyspark. The RDD action is triggered in this function//Pyspark. The Collect of the RDD is triggered by calling this method to execute the RDD and task commit def Collectandserve[t] (rdd:rdd[t]): Int = {//Parameter RDD is pyspark in the Rdd _jrdd, corresponding to Scala The data source Rdd or Pythonrdd//here Rdd.collect () triggers the task to start running ServeiteRator (Rdd.collect (). Iterator, S "Serve Rdd ${rdd.id}")}//The function is to write the result of the calculation to the local socket, and then read the local socket in the Pyspark to obtain the result def ser Veiterator[t] (items:iterator[t], threadname:string): Int = {///You can see that the socket is set up on the local random port and localhost on the Val serversocket = new ServerSocket (0, 1, inetaddress.getbyname ("localhost"))//Close The socket if no connection in 3 seconds serv Ersocket.setsotimeout (3000)//Here start a thread responsible for writing the results to the socket new thread (threadname) {Setdaemon (true) override D EF run () {try {val sock = Serversocket.accept () val out = new DataOutputStream (New BUFFEREDOUTP Utstream (Sock.getoutputstream)) utils.trywithsafefinally {//is specifically responsible for writing this function, this function mainly does some types and serializes the work of WR Iteiteratortostream (items, out)} {Out.close ()}} catch {case nonfatal (e) = = LogError (S "Error while sending iterator", E)} finally {Serversocket.close ()} }}.start ()//Finally, return the network port of this socket so that Pyspark can read the data through this port Serversocket.getlocalport}//This function is responsible for writing data results//doing some type checking and corresponding serialization work//Pythonru This function is also used when Writerthread writes data in Nner def Writeiteratortostream[t] (iter:iterator[t], dataout:dataoutputstream) {def write ( Obj:any): Unit = obj match {case null = Dataout.writeint (speciallengths.null) case Arr:array[byte] = Dataout.writeint (arr.length) dataout.write (arr) Case str:string = writeUTF (str, data Out) Case Stream:portabledatastream = Write (Stream.toarray ()) Case (key, value) = Write (k EY) Write (value) Case other = throw new Sparkexception ("unexpected element type" + Other.getclass) } iter.foreach (write)}}
Pyspark corresponding Scala code Pythonrdd object