Use Pandas DataFrame in Spark dataFrame

Source: Internet
Author: User
Tags pyspark xgboost
background
Items Pandas Spark
Working style Stand-alone, unable to process large amounts of data Distributed, capable of processing large amounts of data
Storage mode Stand-alone cache Can call Persist/cache distributed cache
is variable Is Whether
Index indexes Automatically created No index
Row structure Pandas.series Pyspark.sql.Row
Column structure Pandas.series Pyspark.sql.Column
Allow column names Whether Is

Pandas DataFrame cannot support calculations of large amounts of data, you can try spark df to solve this problem. I. Examples of xgboost predictions before optimization

Import Xgboost as XGB
import pandas as PD
import numpy as NP

# load model
BST = XGB. Booster ()
Bst.load_model ("Xxx.model")

# variable list
var_list=[...]
Df.rdd.map (Lambda X:cal_xgb_score (x,var_list,ntree_limit=304)). WRITE.TODF ()

# calculates
the fraction def cal_xgb_score (x, VAR_LIST,NTREE_LIMIT=50):
    feature_count = Len (var_list)
    x1 = PD. DataFrame (Np.array (x). Reshape (1,feature_count), columns=var_list)
    # Data change operation
    y1 = transformfun (x1)

    test_x = XGB. Dmatrix (Y1.drop ([' Mobile ', ' mobile_md5 '],xais=1), Missing=float (' nan '))
    y1[' score '] = bst.predict (test_x, Ntree_limit=ntree_limit)
    y2 = y1[[' mobile ', ' mobile_md5 ', ' score '] [
    return {' Mobile ': Str (y2[' mobile '][0]) , ' mobille_md5 ': str (y2[' MOBILE_MD5 '][0]), ' Score ': Float (y2[' score '][0])}

Each piece of data is converted to PD, adding additional overhead. after optimization:

def cal_xgb_score (x,var_list,ntree_limit=50):
    feature_count = Len (var_list)
    //Convert iterator to list 
    x1 = PD. DataFrame (list (x), columns=var_list) ...
    Convert PDF to dictionary
    return y1[[' mobile ', ' mobile_md5 ', ' Score ']].to_dict (orient= ' record ')
two. Examples of Topandas before optimization:
Df.topandas ()
after optimization:
Import Pandas as PD
def _map_to_pandas (rdds):
    return [PD. DataFrame (List (Rdds))]

def topandas (DF, N_partitions=none):
    If n_partitions is not NONE:DF = Df.repartition (n_ partitions)
    Df_pand = Df.rdd.mapPartitions (_map_to_pandas). Collect ()
    Df_pand = Pd.concat (df_pand)
    df_ Pand.columns = Df.columns
    return df_pand

# 98 column, 22W row, type Array/string/long/int, partition
df = Spark.sql ("..."). Sample (FALSE,0.002)

Df.cache ()
df.count ()

# Native Topandas method
%timeit Df.topandas ()

# Distributed Topandas
%timeit Topandas (DF)

#使用 Apache Arrow,spark version 2.3 and above
spark.sql ("set Spark.sql.execution.arrow.enabled=true ")
%timeit Df.topandas ()
Summary I. Xgboost projections

Single data processing speed increased from record/min to 3278 record/min

Tips: If a partition has too much data, it will cause executor Oom two. Spark Dataframe pandas Dataframe

Cost
type(seconds)
Native Topandas 12
Distributed Topandas 5.91
Arrow Topandas 2.52

The data returned by Topandas is ultimately cached in driver's memory, and it is not recommended to return too large data.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.