background
Items |
Pandas |
Spark |
Working style |
Stand-alone, unable to process large amounts of data |
Distributed, capable of processing large amounts of data |
Storage mode |
Stand-alone cache |
Can call Persist/cache distributed cache |
is variable |
Is |
Whether |
Index indexes |
Automatically created |
No index |
Row structure |
Pandas.series |
Pyspark.sql.Row |
Column structure |
Pandas.series |
Pyspark.sql.Column |
Allow column names |
Whether |
Is |
Pandas DataFrame cannot support calculations of large amounts of data, you can try spark df to solve this problem. I. Examples of xgboost predictions before optimization
Import Xgboost as XGB
import pandas as PD
import numpy as NP
# load model
BST = XGB. Booster ()
Bst.load_model ("Xxx.model")
# variable list
var_list=[...]
Df.rdd.map (Lambda X:cal_xgb_score (x,var_list,ntree_limit=304)). WRITE.TODF ()
# calculates
the fraction def cal_xgb_score (x, VAR_LIST,NTREE_LIMIT=50):
feature_count = Len (var_list)
x1 = PD. DataFrame (Np.array (x). Reshape (1,feature_count), columns=var_list)
# Data change operation
y1 = transformfun (x1)
test_x = XGB. Dmatrix (Y1.drop ([' Mobile ', ' mobile_md5 '],xais=1), Missing=float (' nan '))
y1[' score '] = bst.predict (test_x, Ntree_limit=ntree_limit)
y2 = y1[[' mobile ', ' mobile_md5 ', ' score '] [
return {' Mobile ': Str (y2[' mobile '][0]) , ' mobille_md5 ': str (y2[' MOBILE_MD5 '][0]), ' Score ': Float (y2[' score '][0])}
Each piece of data is converted to PD, adding additional overhead. after optimization:
def cal_xgb_score (x,var_list,ntree_limit=50):
feature_count = Len (var_list)
//Convert iterator to list
x1 = PD. DataFrame (list (x), columns=var_list) ...
Convert PDF to dictionary
return y1[[' mobile ', ' mobile_md5 ', ' Score ']].to_dict (orient= ' record ')
two. Examples of Topandas
before optimization:
Df.topandas ()
after optimization:
Import Pandas as PD
def _map_to_pandas (rdds):
return [PD. DataFrame (List (Rdds))]
def topandas (DF, N_partitions=none):
If n_partitions is not NONE:DF = Df.repartition (n_ partitions)
Df_pand = Df.rdd.mapPartitions (_map_to_pandas). Collect ()
Df_pand = Pd.concat (df_pand)
df_ Pand.columns = Df.columns
return df_pand
# 98 column, 22W row, type Array/string/long/int, partition
df = Spark.sql ("..."). Sample (FALSE,0.002)
Df.cache ()
df.count ()
# Native Topandas method
%timeit Df.topandas ()
# Distributed Topandas
%timeit Topandas (DF)
#使用 Apache Arrow,spark version 2.3 and above
spark.sql ("set Spark.sql.execution.arrow.enabled=true ")
%timeit Df.topandas ()
Summary
I. Xgboost projections
Single data processing speed increased from record/min to 3278 record/min
Tips: If a partition has too much data, it will cause executor Oom two. Spark Dataframe pandas Dataframe
type | Cost
(seconds) |
Native Topandas |
12 |
Distributed Topandas |
5.91 |
Arrow Topandas |
2.52 |
The data returned by Topandas is ultimately cached in driver's memory, and it is not recommended to return too large data.