Pyspark Internal implementation

Source: Internet
Author: User
Tags pyspark

Pyspark implements the Spark API for Python,
Through it, users can write Python programs that run on top of Spark,
Thus, the characteristics of Spark distributed computing are utilized. Basic Process

The overall architecture of Pyspark is as follows,
You can see that the implementation of the Python API relies on Java APIs,
Python program-side Sparkcontext call Javasparkcontext via py4j,
The latter is an encapsulation of Scala's sparkcontext.
The functions for converting and manipulating rdd are defined by the user through a Python program,
These functions are serialized and sent to each worker,
Each worker then initiates a Python process to perform the deserialized function,
Through the pipeline to get the results after execution.

The details are described in more detail below. start of the Python program

Like the Scala program, Python programs are executed through Sparksubmit submissions,
In the Sparksubmit, you will judge whether the submitted program is Python,
If it is, set MainClass to Pythonrunner.

The execution code for the Pythonrunner is as follows:

def main (Args:array[string]) {val pythonfile = args (0) Val pyfiles = args (1) Val Otherargs = Args.slice (2, Args.le  Ngth) Val pythonexec = Sys.env.get ("Pyspark_python"). Getorelse ("PYTHON")//Todo:get this from CONF//Format PYTHON  File paths before adding them to the Pythonpath val formattedpythonfile = Formatpath (pythonfile) Val formattedpyfiles = Formatpaths (pyfiles)//Launch a PY4J gateway server for the process to connect to; This'll let it in our//Java system properties and such Val gatewayserver = new py4j. Gatewayserver (NULL, 0) Gatewayserver.start ()//build up a pythonpath that includes the Spark assembly JAR (where thi S class is), the//Python directories in Spark_home (if set), and any files in the Pyfiles argument Val pathelements  = new Arraybuffer[string] pathelements ++= formattedpyfiles pathelements + = Pythonutils.sparkpythonpath pathElements + + sys.env.getOrElse ("Pythonpath", "") val Pythonpath = Pythonutils.mergepythonpatHS (pathelements: _*)//Launch Python process val builder = new Processbuilder (Seq (Pythonexec, "-U", formattedpythonf  ile) + + Otherargs) val env = Builder.environment () env.put ("Pythonpath", Pythonpath) env.put ("Pyspark_gateway_port", "" + Gatewayserver.getlisteningport) Builder.redirecterrorstream (TRUE)//Ugly but needed for stdout and stderr to sync Hronize val Process = Builder.start () New Redirectthread (Process.getinputstream, System.out, "redirect Output"). Start
 () System.exit (Process.waitfor ())}

In Pythonrunner, depending on the configuration options and the--py-files options provided by the user through the command line,
Set the Pythonpath, and then start a Java gatewayserver to be invoked by the Python program,
Then use the user-configured Pyspark_python option as the Python interpreter,
Execute the python file so that the user's Python program starts. Sparkcontext

As in Scala, Sparkcontext is the portal that calls spark for calculations.
A class sparkcontext is defined in the Python context.py.
It encapsulates a javasparkcontext as its _JSC attribute.
When the Sparkcontext is initialized,
The Launch_gateway method defined in java_gateway.py is first invoked to initialize the Javagateway.
In Launch_gateway, you introduce the properties _JVM the classes defined in spark to the Sparkcontext.
Like what:

Java_import (GATEWAY.JVM, "org.apache.spark.SparkConf")

This can be done through SPARKCONTEXT._JVM in Python. Sparkconf refers to the class of sparkconf defined in Scala,
You can instantiate an object of this class, you can invoke the object's methods, and so on.

After initialization, the user can invoke the methods in Sparkcontext, such as Textfile and Parallelize,
The following two methods are used as examples to see the implementation of Sparkcontext. the realization of textfile

Textfile's call, as in Scala, provides a path, as well as an optional parameter minpartitions,
The latter describes the minimum number of partition and returns a RDD. The implementation of Textfile is as follows:

def textfile (self, Name, Minpartitions=none):
    minpartitions = minpartitions or min (self.defaultparallelism, 2) Return
    RDD (self._jsc.textfile (name, minpartitions), self,
               utf8deserializer ())

The sparkcontext in Python calls Javasparkcontext.textfile,
The latter returns a javardd[string] (Javardd is a package of rdd that can be viewed directly as a rdd),
Python encapsulates javardd into Python rdd (see below for rdd details). the realization of parallelize

Parallelize the list in Python into Rdd,
Call Example:

>>> sc.parallelize (Range (5), 5). Collect ()
[0, 1, 2, 3, 4]

The implementation code for the parallelize is as follows:

Def parallelize (self, C, numslices=none): numslices = numslices or Self.defaultparallelism # calling the Java parallel  Ize () method with a ArrayList is too slow, # because it sends O (n) py4j commands.
  As a alternative, serialized # objects are written to a file and loaded through (). Tempfile = Namedtemporaryfile (Delete=false, Dir=self._temp_dir) # Make sure we distribute data evenly if it ' s smaller th An self.batchsize if "__len__" is not in Dir (c): c = List (c) # make it a list so we can compute its length batch Size = min (len (c)//numslices, self._batchsize) if batchsize > 1:serializer = Batchedserializer (self._unbatch Ed_serializer, batchsize) Else:serializer = Self._unbatched_serializer ser Ializer.dump_stream (c, Tempfile) tempfile.close () Readrddfromfile = SELF._JVM. Pythonrdd.readrddfromfile Jrdd = Readrddfromfile (SELF._JSC, Tempfile.name, numslices) return RDD (Jrdd, self, serialize
R 

First, the data is serialized into a temporary file,
Then call Pythonrdd's readrddfromfile to read a byte from the file,
Back into Javardd[array[byte]]. Finally packaged into Python's rdd. RDD

The RDD in Python encapsulates the rdd in Spark,
Each RDD corresponds to a deserialized function.
This is because, although the elements of RDD in spark can have any type,
provided to

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.