|
Pandas |
Spark |
Working style |
Single machine tool, no parallel mechanism parallelism does not support Hadoop and handles large volumes of data with bottlenecks |
Distributed parallel computing framework, built-in parallel mechanism parallelism, all data and operations are automatically distributed on each cluster node. Process distributed data in a way that handles in-memory data. Supports Hadoop and can handle large amounts of data |
Delay mechanism |
Not lazy-evaluated |
lazy-evaluated |
Memory Cache |
Stand-alone cache |
Persist () or cache () saves the converted Rdds in memory |
Dataframe variability |
Dataframe in Pandas is variable. |
Rdds in Spark is immutable, so dataframe is immutable. |
Create |
Convert from SPARK_DF: PANDAS_DF = Spark_df.topandas () |
Convert from PANDAS_DF: SPARK_DF = Sqlcontext.createdataframe (PANDAS_DF) In addition, Createdataframe supports the conversion of SPARK_DF from list, where the list element can be Tuple,dict,rdd |
List,dict,ndarray Conversion |
Existing Rdds conversions |
CSV Data Set Read |
Structured data file reads |
HDF5 Read |
JSON data Set Read |
Excel reads |
Hive Table Read |
|
External database Read |
Index indexes |
Automatically created |
There are no index indexes and you need to create additional columns if needed |
Row structure |
Series structure, belonging to the pandas Dataframe structure |
Row structure, which belongs to the spark dataframe structure |
Column structure |
Series structure, belonging to the pandas Dataframe structure |
Column structure, which belongs to the spark DataFrame structure, such as: dataframe[name:string] |
Column Name |
Duplicate names are not allowed |
Allow duplicate names Modify column names by using the alias method |
Column Additions |
df["XX"] = 0 |
Df.withcolumn ("xx", 0). Show () will error From Pyspark.sql import functions Df.withcolumn ("xx", Functions.lit (0)). Show () |
Column modification |
Originally has df["XX"] column, df["xx"] = 1 |
Originally has df["XX"] column, Df.withcolumn ("xx", 1). Show () |
Show |
|
DF does not output specific content, output specific content with the Show method Output form: Dataframe[age:bigint, name:string] |
DF Output Specific Content |
Df.show () Output specific content |
No tree structure output form |
Print a summary in the form of a tree: Df.printschema () |
|
Df.collect () |
Sort |
Df.sort_index () Sort by axis |
|
Df.sort () Sort by value in column |
Df.sort () Sort by value in column |
Select or slice |
Df.name Output Specific content |
Df[] do not output specific content, output specific content with the Show method df["name"] do not output specific content, output specific content with the Show method |
df[] Output specific content, df["name"] Output specific content |
Df.select () Select one or more columns Df.select ("name") Slice df.select (df[' name '], df[' age ']+1) |
DF[0] DF.IX[0] |
Df.first () |
Df.head (2) |
Df.head (2) or Df.take (2) |
Df.tail (2) |
|
Slice Df.ix[:3] or df.ix[: "XX"] or df[: "XX"] |
|
Df.loc[] Select by label |
|
df.iloc[] Select by location |
|
Filter |
df[df[' age ']>21] |
Df.filter (df[' age ']>21) or df.where (df[' age ']>21) |
Integration |
Df.groupby ("Age") Df.groupby ("A"). AVG ("B") |
Df.groupby ("Age") Df.groupby ("A"). AVG ("B"). Show () apply a single function From Pyspark.sql import functions Df.groupby ("A"). Agg (Functions.avg ("B"), Functions.min ("B"), Functions.max ("B")). Show () apply multiple functions |
Statistics |
Df.count () outputs the number of non-empty rows for each column |
Df.count () Output Total rows |
Df.describe () Describes the count, mean, STD, min, 25%, 50%, 75%, max for some columns |
Df.describe () Describes the count, mean, StdDev, Min, max for some columns |
Merge |
Concat method under Pandas, supports axial merging |
|
Merge method under Pandas, support multi-column merge The same name is automatically added suffix, the corresponding key is reserved only one copy |
The Join method under Spark is Df.join () The same name does not automatically add suffixes, only the key values exactly match to keep a copy |
Df.join () supports multi-column merging |
|
Df.append () supports multi-line merging |
|
Missing data processing |
Automatically add NaNs to missing data |
Do not automatically add NaNs and do not throw errors |
Fillna function: Df.fillna () |
Fillna function: Df.na.fill () |
Dropna function: Df.dropna () |
Dropna function: Df.na.drop () |
SQL statements |
Import Sqlite3 Pd.read_sql ("Select name, age from People WHERE age >= and age <= 19″) |
Form registration: Registering the DATAFRAME structure as a SQL statement usage type Df.registertemptable ("people") or sqlcontext.registerdataframeastable (DF, "people") Sqlcontext.sql ("Select name, age from People WHERE age >= and age <= 19″) |
Feature registration: Registering functions as SQL statement usage type Sqlcontext.registerfunction ("stringlengthstring", Lambda X:len (x)) Sqlcontext.sql ("Select stringlengthstring (' Test ')") |
They convert each other. |
PANDAS_DF = Spark_df.topandas () |
SPARK_DF = Sqlcontext.createdataframe (PANDAS_DF) |
function application |
Df.apply (f) Apply function f to each column of DF |
Df.foreach (f) or Df.rdd.foreach (f) Apply the function f to each column of DF Df.foreachpartition (f) or df.rdd.foreachPartition (f) applies each block of DF to the application function f |
Map-reduce operation |
Map (func, list), Reduce (func, list) return type seq |
Df.map (func), Df.reduce (func) return type Seqrdds |
diff operation |
There is a diff operation to process the time series data (pandas will compare the current line to the previous row) |
No diff operation (the upper and lower lines of spark are independent of each other, distributed storage) |