background
Pyspark Performance enhancements: [spark-22216][spark-21187] Significant improvements in Python Performance and Interoperability by fast data serialization and vectorized execution.
SPARK-22216: The main implementation of Vectorization pandas UDF processing, and solve related pandas/arrow problems;
SPARK-21187: I know a issue that has not been resolved so far, the arrow type still does not support Binarytype, Maptype, arraytype of Timestamptype, and nested Structtype.
This issue is a complex issue, involving components such as Pyspark, Arrow, pandas, and Spark SQL, each of which has a worthwhile research area. This article mainly combs down the operating logic of the Pyspark, followed by the gradual analysis of the issue adopt which optimization means to improve performance. Overview
The driver end user instantiates a Python Sparkcontext object in Pyspark and eventually instantiates the Scala Sparkcontext object in the JVM;
The executor side does not need to use PY4J, because the task logic running on the executor side is sent by driver, which is the serialized bytecode. Although it may contain user-defined Python functions or lambda expressions, py4j does not implement a method called Python in Java, in order to be able to run a user-defined Python function or lambda expression on the executor side. You need to start a Python process for each task individually and send a Python function or lambda expression to the Python process for execution through the socket communication method.
After the Python program submits through Sparksubmit, you can see the following main functions for its final run:
If we ' re running a Python app, set the main class to our specific Python runner if (Args.ispython && de Ploymode = = CLIENT) {if (Args.primaryresource = = Pyspark_shell) {Args.mainclass = "org.apache.spark.api.py Thon. Pythongatewayserver "} else {//If a Python file is provided, add it to the child arguments and List of Fil
Es to deploy. Usage:pythonapprunner <main python file> <extra python files> [app arguments] Args.mainclass = "or G.apache.spark.deploy.pythonrunner "Args.childargs = ArrayBuffer (Localprimaryresource, localpyfiles) + + Args.child Args if (clustermanager! = yarn) {//the Yarn backend distributes the primary file differently, so don '
T merge it.
Args.files = Mergefilelists (Args.files, Args.primaryresource)}} if (Clustermanager! = YARN) {
The YARN backend handles Python files differently, so don ' t merge the lists. args.fiLes = mergefilelists (Args.files, Args.pyfiles)} if (localpyfiles! = null) {Sparkconf.set ("SPARK.SUBM It.pyfiles ", Localpyfiles)}}
Driver End
When the user Python script is up, the Python version of the Sparkcontext object is first instantiated, and two things are done during instantiation: instantiating py4j gatewayclient, connecting py4j gatewayserver in the JVM, Subsequent calls to Java in Python are done by using the Py4j gateway to instantiate the Sparkcontext object in the JVM via the Py4j gateway
After the above two steps, the Sparkcontext object initialization is complete, driver has been up, began to apply for executor resources, and began to dispatch tasks. A series of processing logic defined in a user's Python script will eventually trigger the job's commit after encountering the action method. The job is submitted directly through the PY4J call Java Pythonrdd.runjob method completion, mapped into the JVM, will be transferred to the Sparkcontext.runjob method, after the job is completed, the JVM will open a local socket waiting for the python process to pull, corresponding to the Py The Thon process will pull the result through the socket after calling Pythonrdd.runjob. Pythonrunner Source Code Analysis
The main function of the Pythonrunner entrance is to do two things: open py4j gatewayserver Run the user-uploaded Python script in Java process mode
The specific source analysis is as follows:
Object Pythonrunner {def main (args:array[string]) {val pythonfile = args (0) Val pyfiles = args (1) Val oth Erargs = Args.slice (2, args.length) val sparkconf = new sparkconf () Val pythonexec = Sparkconf.get (pyspark_driver_ PYTHON). OrElse (Sparkconf.get (Pyspark_python)). OrElse (Sys.env.get ("Pyspark_driver_python")). OrElse (SYS . Env.get ("Pyspark_python")). Getorelse ("python")//Format PYTHON file paths before adding them to the PYTHONPA TH val formattedpythonfile = Formatpath (pythonfile) Val formattedpyfiles = formatpaths (pyfiles)//Open py4j GAT eWAY Service for communication with executor; set to daemon, another thread;//Launch a PY4J gateway server for the process to connect to; This would let it see our//Java system properties and such Val gatewayserver = new py4j. Gatewayserver (NULL, 0) val thread = new Thread (new Runnable () {override def run (): Unit = Utils.loguncaughtexce ptions {Gatewayserver.start ()}}) threAd.setname ("Py4j-gateway-init") Thread.setdaemon (True) Thread.Start ()//Wait until the gateway server has St
Arted, so, we know which port are it bound to. ' Gatewayserver.start () ' would start a new thread and run the server code there, after//initializing the socket, so
The thread started above would end as soon as the server is//ready to serve connections.
Note here: You will need to wait for the gateway server to open, so you can know which port is bound. Thread.Join ()//Build up a PYTHONPATH so includes the Spark assembly (where this class is), the//Python dire Ctories in Spark_home (if set), and any files in the Pyfiles argument val pathelements = new Arraybuffer[string] P Athelements ++= formattedpyfiles pathelements + = Pythonutils.sparkpythonpath pathelements + = Sys.env.getOrElse ("PY Thonpath "," ") Val PythonPath = pythonutils.mergepythonpaths (pathelements: _*)//Launch Python process//here
Initializes a process that executes the Python command, executing a user-submitted Python file. Val Builder = new PROCESSBUilder (Seq (pythonexec, Formattedpythonfile) + + Otherargs). Asjava) Val env = Builder.environment () env.put ("PYTHON PATH ", PythonPath)//This was equivalent to setting THE-U flag; We use it because Ipython doesn ' t support-u: Env.put ("pythonunbuffered", "yes")//value are needed to being set to a non -empty string Env.put ("Pyspark_gateway_port", "" "+ gatewayserver.getlisteningport)//Pass CONF Spark.pyspark.pyth
On-to-Python process, the only-to-pass info to//Python process is through environment variable. Sparkconf.get (Pyspark_python). foreach (Env.put ("Pyspark_python", _)) Sys.env.get ("Pythonhashseed"). foreach ( Env.put ("Pythonhashseed", _)) Builder.redirecterrorstream (TRUE)//Ugly but needed for stdout and stderr to Synchroniz E try {val process = Builder.start () New Redirectthread (Process.getinputstream, System.out, "redirect OU Tput "). Start () Val ExitCode = Process.waitfor () if (ExitCode! = 0) {throw new SParkuserappexception (ExitCode)}} finally {Gatewayserver.shutdown ()}}
Executor End
Specifically, the Pyspark logic under Spark Source is analyzed, and its implementation logic is basically: This program, written by the Python API provided by Pyspark, initializes the Sparkcontext (Python) when it is created. The gateway variable (Javagateway object) and the _JVM variable (Jvmview object) are implemented to encapsulate the spark operator.
It is also important to note that in the gateway, the spark package is introduced, which can be used directly in the Pyspark:
# Import The classes used by Pyspark
Java_import (GATEWAY.JVM, "org.apache.spark.SparkConf")
Java_import ( GATEWAY.JVM, "org.apache.spark.api.java.*")
Java_import (GATEWAY.JVM, "org.apache.spark.api.python.*")
Java_import (GATEWAY.JVM, "org.apache.spark.ml.python.*")
Java_import (GATEWAY.JVM, " Org.apache.spark.mllib.api.python.* ")
# TODO (Davies): Move into SQL
java_import (GATEWAY.JVM," Org.apache.spark.sql.* ")
Java_import (GATEWAY.JVM," org.apache.spark.sql.api.python.* ")
Java_import ( GATEWAY.JVM, "org.apache.spark.sql.hive.*")
Java_import (GATEWAY.JVM, "Scala. Tuple2 ")
Precautions for use
If you want to use arrow and pandas, by default spark does not have these packages (py4j dependent packages Spark has provided), and you need to install Pyarrow and pandas yourself. The requirements for the version are as follows:
extras_require={
' ml ': [' numpy>=1.7 '],
' mllib ': [' numpy>=1.7 '],
' sql ': [
' pandas>=%s '% _ Minimum_pandas_version,
' pyarrow>=%s '% _minimum_pyarrow_version,
]
},
_minimum_pandas_ Version = "0.19.2"
_minimum_pyarrow_version = "0.8.0"
When calling the Df.topandas () function, you can refer to https://issues.apache.org/jira/browse/SPARK-13534 to improve read file performance.
Val arrow_execution_enable =
buildconf ("spark.sql.execution.arrow.enabled")
. doc ("When True" Apache Arrow for columnar data transfers. Currently available "+
" for use with Pyspark.sql.DataFrame.toPandas, and "+
" Pyspark.sql.SparkSession.createDataFrame when it input is a Pandas DataFrame. "+
" The following data types is unsupported: "+
" Binarytype, Maptype, arraytype of Timestamptype, and nested St Ructtype. ")
. Booleanconf
. Createwithdefault (False)
ReferencePyspark behind the principle: http://sharkdtu.com/posts/pyspark-internal.html https://www.jianshu.com/p/013fe44422c9 https:// issues.apache.org/jira/browse/spark-13534 accelerating data access for pandas users on Hadoop clusters:http:// wesmckinney.com/blog/pandas-and-apache-arrow/