Pyspark Pandas UDF

Source: Internet
Author: User
Tags rand scalar serialization pyspark in python
Configuration


All running nodes are installed Pyarrow, need >= 0.8

 Why there is pandas UDF



Over the past few years, Python is becoming the default language for data analysts. Some similar pandas,numpy,statsmodel,scikit-learn have been used extensively, becoming the mainstream toolkit. At the same time, Spark became the standard for big data processing, and in order for data analysts to use spark, Spark added the Python API to version 0.7 and also supported UDFs (user-defined functions).



These UDFs operate once for each record, and the data needs to be transferred in the JVM and Python, so there is additional serialization and invocation overhead. So you can define UDFs in Java and Scala, and then call them in Python.



Pandas UDFs Why are you fast?



Built on Apache Arrow, the Pandas UDF brings a low-overhead, high-performance UDF.



Each system has its own storage format, 70%-80% of the time spent on serialization and deserialization



Apache Arrow: A cross-platform, Columnstore data layer in memory to speed up big data analysis. The cyclic execution is converted to a pandas vectorization calculation. Python and JVM Use the same data structure to avoid serialization overhead



The amount of data per batch for vectorization is controlled by the Spark.sql.execution.arrow.maxRecordsPerBatch parameter, which defaults to 10,000. If the columns is particularly numerous at one time, the value can be reduced appropriately.

  some restrictions



All sparksql data types are not supported, including Binarytype,maptype, Arraytype,timestamptype, and nested Structtype.



Pandas UDFs and UDFs cannot be mixed.

 1.   How to use Spark DF & Pandas DF



Spark DF and pandas DF are optimized for mutual conversion performance and need to be turned on for configuration, which is off by default.



Configuration items:


spark.sql.execution.arrow.enabled true


Mutual transformation


Import NumPy as NP
import pandas as PD

//Initialize pandas DF
PDF = PD. DataFrame (Np.random.rand (100000, 3))
//pdf, SDF
%time df = spark.createdataframe (pdf)
//SDF-pdf
%time result_pdf = Df.select ("*"). Topandas ()


Performance comparison:


execution.arrow.enabled SDF, pdf SDF-pdf
False 4980ms 722ms
True 72ms 79ms


Tips: Even if you increase the speed of conversion, pandas DF is still a stand-alone execution in driver and should not return large amounts of data. 

2. Pandas UDFs (vectorized UDFs)



The entry and return value types of the pandas UDF are pandas.

 Series Registration UDF



Method 1:


From pyspark.sql.functions import pandas_udf

def plus_one (a):
    return a + 1

//df_udf
plus_one_pd_udf = PANDAS_UDF (Plus_one, Returntype=longtype ())
//sql UDF
spark.udf.register (' Plus_one ', plus_one_pd_udf)


Method 2:


From pyspark.sql.functions import pandas_udf

//default is Pandasudftype.scalar type
@pandas_udf (' Long ')
def plus_ One (a):
    return a + 1

spark.udf.register (' Plus_one ', Plus_one)


Spark.udf.register can accept a sql_batched_udf or Sql_scalar_pandas_udf method.



With the pandas UDF, the physical execution plan changes from Batchevalpython to Arrowevalpython, and you can use explain () to check that the pandas UDF is in effect. 

Scalar Pandas UDFs


Import pandas as PD from

pyspark.sql.functions import col, pandas_udf,udf from
pyspark.sql.types import Longtype

def Multiply_func (A, B):
    return a * b

multiply_pd = pandas_udf (Multiply_func, Returntype=longtype ())

multiply = UDF (Multiply_func, Returntype=longtype ())

x = PD. Series ([1, 2, 3] * 10000)
df = Spark.createdataframe (PD. DataFrame (x, columns=["X"]))

%timeit df.select (multiply_pd (Col ("X"), col ("X")). Count ()
%timeit Df.select (Multiply (col ("X"), col ("X")). Count ()
grouped Map Pandas UDFs


Calculate mean Variance


From pyspark.sql.functions import pandas_udf, pandasudftype

df = Spark.createdataframe (
    [(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)], (
    "id", "V"))

@pandas_udf ("id long, v double", Pandasudftype.grouped_map)
def Substract_mean (PDF):
    # PDF is a pandas. DataFrame
    v = pdf.v
    return pdf.assign (V=v-v.mean ())

df.groupby ("id"). Apply (Substract_mean). Show ()

+---+----+
| id|   v|
+---+----+
|  1|-0.5|
|  1| 0.5|
|  2|-3.0|
|  2|-1.0|
|  2| 4.0|
+---+----+
Test Cases


Data preparation: 10m-row DataFrame, 2 columns, one column int type, one column double type


DF = Spark.range (0, ten * +). Withcolumn (' id ', (col (' id ')/10000). Cast (' Integer '). Withcolumn (' V ', rand ())
Df.cache ()
Df.count ()
Plus One
From pyspark.sql.functions import pandas_udf, pandasudftype

# input and output are pandas of the doubles type. Series
@pandas_udf (' Double ', pandasudftype.scalar)
def pandas_plus_one (v):
    return v + 1

Df.withcolumn (' v2 ', Pandas_plus_one (DF.V))
Cumulative probability
Import pandas as PD from
scipy import stats

@pandas_udf (' Double ')
def CDF (v):
    return PD. Series (STATS.NORM.CDF (v))

df.withcolumn (' cumulative_probability ', CDF (DF.V))
Subtract Mean
# Both the input and output types are pandas. DataFrame
@pandas_udf (Df.schema, Pandasudftype.grouped_map)
def subtract_mean (pdf):
    return Pdf.assign (V=pdf.v-pdf.v.mean ())

df.groupby (' id '). Apply (Subtract_mean)
some differences between Scalar and grouped maps ..
... Scalar grouped Map
UDF Import parameter type Pandas. Series Pandas. DataFrame
UDF return type Pandas. Series Pandas. DataFrame
Aggregation semantics No Clauses of GroupBy
return size Consistent with input Rows and columns can be different from the entry parameters
return type declaration Pandas. Series of DataType Pandas. DataFrame's Structtype
Performance Comparison
type UDF Pandas UDF
Plus_one 2.54s 1.28s
Cdf 2min 2s 1.52s
Subtract Mean 1min 8s 4.4s

Configuration and test methods EnvironmentSpark 2.3 Anaconda 4.4.0 (python 2.7.13) run mode local[10


Reference

Http://spark.apache.org/docs/latest/sql-programming-guide.html#grouped-map
Https://databricks.com/blog/2017/10/30/introducing-vectorized-udfs-for-pyspark.html
Https://www.slideshare.net/PyData/improving-pandas-and-pyspark-performance-and-interoperability-with-apache-arrow


Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.