Configuration
All running nodes are installed Pyarrow, need >= 0.8
Why there is pandas UDF
Over the past few years, Python is becoming the default language for data analysts. Some similar pandas,numpy,statsmodel,scikit-learn have been used extensively, becoming the mainstream toolkit. At the same time, Spark became the standard for big data processing, and in order for data analysts to use spark, Spark added the Python API to version 0.7 and also supported UDFs (user-defined functions).
These UDFs operate once for each record, and the data needs to be transferred in the JVM and Python, so there is additional serialization and invocation overhead. So you can define UDFs in Java and Scala, and then call them in Python.
Pandas UDFs Why are you fast?
Built on Apache Arrow, the Pandas UDF brings a low-overhead, high-performance UDF.
Each system has its own storage format, 70%-80% of the time spent on serialization and deserialization
Apache Arrow: A cross-platform, Columnstore data layer in memory to speed up big data analysis. The cyclic execution is converted to a pandas vectorization calculation. Python and JVM Use the same data structure to avoid serialization overhead
The amount of data per batch for vectorization is controlled by the Spark.sql.execution.arrow.maxRecordsPerBatch parameter, which defaults to 10,000. If the columns is particularly numerous at one time, the value can be reduced appropriately.
some restrictions
All sparksql data types are not supported, including Binarytype,maptype, Arraytype,timestamptype, and nested Structtype.
Pandas UDFs and UDFs cannot be mixed.
1. How to use Spark DF & Pandas DF
Spark DF and pandas DF are optimized for mutual conversion performance and need to be turned on for configuration, which is off by default.
Configuration items:
spark.sql.execution.arrow.enabled true
Mutual transformation
Import NumPy as NP
import pandas as PD
//Initialize pandas DF
PDF = PD. DataFrame (Np.random.rand (100000, 3))
//pdf, SDF
%time df = spark.createdataframe (pdf)
//SDF-pdf
%time result_pdf = Df.select ("*"). Topandas ()
Performance comparison:
execution.arrow.enabled |
SDF, pdf |
SDF-pdf |
False |
4980ms |
722ms |
True |
72ms |
79ms |
Tips: Even if you increase the speed of conversion, pandas DF is still a stand-alone execution in driver and should not return large amounts of data.
2. Pandas UDFs (vectorized UDFs)
The entry and return value types of the pandas UDF are pandas.
Series Registration UDF
Method 1:
From pyspark.sql.functions import pandas_udf
def plus_one (a):
return a + 1
//df_udf
plus_one_pd_udf = PANDAS_UDF (Plus_one, Returntype=longtype ())
//sql UDF
spark.udf.register (' Plus_one ', plus_one_pd_udf)
Method 2:
From pyspark.sql.functions import pandas_udf
//default is Pandasudftype.scalar type
@pandas_udf (' Long ')
def plus_ One (a):
return a + 1
spark.udf.register (' Plus_one ', Plus_one)
Spark.udf.register can accept a sql_batched_udf or Sql_scalar_pandas_udf method.
With the pandas UDF, the physical execution plan changes from Batchevalpython to Arrowevalpython, and you can use explain () to check that the pandas UDF is in effect.
Scalar Pandas UDFs
Import pandas as PD from
pyspark.sql.functions import col, pandas_udf,udf from
pyspark.sql.types import Longtype
def Multiply_func (A, B):
return a * b
multiply_pd = pandas_udf (Multiply_func, Returntype=longtype ())
multiply = UDF (Multiply_func, Returntype=longtype ())
x = PD. Series ([1, 2, 3] * 10000)
df = Spark.createdataframe (PD. DataFrame (x, columns=["X"]))
%timeit df.select (multiply_pd (Col ("X"), col ("X")). Count ()
%timeit Df.select (Multiply (col ("X"), col ("X")). Count ()
grouped Map Pandas UDFs
Calculate mean Variance
From pyspark.sql.functions import pandas_udf, pandasudftype
df = Spark.createdataframe (
[(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)], (
"id", "V"))
@pandas_udf ("id long, v double", Pandasudftype.grouped_map)
def Substract_mean (PDF):
# PDF is a pandas. DataFrame
v = pdf.v
return pdf.assign (V=v-v.mean ())
df.groupby ("id"). Apply (Substract_mean). Show ()
+---+----+
| id| v|
+---+----+
| 1|-0.5|
| 1| 0.5|
| 2|-3.0|
| 2|-1.0|
| 2| 4.0|
+---+----+
Test Cases
Data preparation: 10m-row DataFrame, 2 columns, one column int type, one column double type
DF = Spark.range (0, ten * +). Withcolumn (' id ', (col (' id ')/10000). Cast (' Integer '). Withcolumn (' V ', rand ())
Df.cache ()
Df.count ()
Plus One
From pyspark.sql.functions import pandas_udf, pandasudftype
# input and output are pandas of the doubles type. Series
@pandas_udf (' Double ', pandasudftype.scalar)
def pandas_plus_one (v):
return v + 1
Df.withcolumn (' v2 ', Pandas_plus_one (DF.V))
Cumulative probability
Import pandas as PD from
scipy import stats
@pandas_udf (' Double ')
def CDF (v):
return PD. Series (STATS.NORM.CDF (v))
df.withcolumn (' cumulative_probability ', CDF (DF.V))
Subtract Mean
# Both the input and output types are pandas. DataFrame
@pandas_udf (Df.schema, Pandasudftype.grouped_map)
def subtract_mean (pdf):
return Pdf.assign (V=pdf.v-pdf.v.mean ())
df.groupby (' id '). Apply (Subtract_mean)
some differences between Scalar and grouped maps ..
... |
Scalar |
grouped Map |
UDF Import parameter type |
Pandas. Series |
Pandas. DataFrame |
UDF return type |
Pandas. Series |
Pandas. DataFrame |
Aggregation semantics |
No |
Clauses of GroupBy |
return size |
Consistent with input |
Rows and columns can be different from the entry parameters |
return type declaration |
Pandas. Series of DataType |
Pandas. DataFrame's Structtype |
Performance Comparison
type |
UDF |
Pandas UDF |
Plus_one |
2.54s |
1.28s |
Cdf |
2min 2s |
1.52s |
Subtract Mean |
1min 8s |
4.4s |
Configuration and test methods EnvironmentSpark 2.3 Anaconda 4.4.0 (python 2.7.13) run mode local[10
Reference
Http://spark.apache.org/docs/latest/sql-programming-guide.html#grouped-map
Https://databricks.com/blog/2017/10/30/introducing-vectorized-udfs-for-pyspark.html
Https://www.slideshare.net/PyData/improving-pandas-and-pyspark-performance-and-interoperability-with-apache-arrow