A detailed comparison of dataframe in spark and pandas

Source: Internet
Author: User
Tags foreach diff join min pyspark
Pandas Spark
Working style Single machine tool, no parallel mechanism parallelism
does not support Hadoop and handles large volumes of data with bottlenecks
Distributed parallel computing framework, built-in parallel mechanism parallelism, all data and operations are automatically distributed on each cluster node. Process distributed data in a way that handles in-memory data.
Supports Hadoop and can handle large amounts of data
Delay mechanism Not lazy-evaluated lazy-evaluated
Memory Cache Stand-alone cache Persist () or cache () saves the converted Rdds in memory
Dataframe variability Dataframe in Pandas is variable. Rdds in Spark is immutable, so dataframe is immutable.
Create Convert from SPARK_DF: PANDAS_DF = Spark_df.topandas () Convert from PANDAS_DF: SPARK_DF = Sqlcontext.createdataframe (PANDAS_DF)
In addition, Createdataframe supports the conversion of SPARK_DF from list, where the list element can be Tuple,dict,rdd
List,dict,ndarray Conversion Existing Rdds conversions
CSV Data Set Read Structured data file reads
HDF5 Read JSON data Set Read
Excel reads Hive Table Read
External database Read
Index indexes Automatically created There are no index indexes and you need to create additional columns if needed
Row structure Series structure, belonging to the pandas Dataframe structure Row structure, which belongs to the spark dataframe structure
Column structure Series structure, belonging to the pandas Dataframe structure Column structure, which belongs to the spark DataFrame structure, such as: dataframe[name:string]
Column Name Duplicate names are not allowed Allow duplicate names
Modify column names by using the alias method
Column Additions df["XX"] = 0 Df.withcolumn ("xx", 0). Show () will error
From Pyspark.sql import functions
Df.withcolumn ("xx", Functions.lit (0)). Show ()
Column modification Originally has df["XX"] column, df["xx"] = 1 Originally has df["XX"] column, Df.withcolumn ("xx", 1). Show ()
Show DF does not output specific content, output specific content with the Show method
Output form: Dataframe[age:bigint, name:string]
DF Output Specific Content Df.show () Output specific content
No tree structure output form Print a summary in the form of a tree: Df.printschema ()
Df.collect ()
Sort Df.sort_index () Sort by axis
Df.sort () Sort by value in column Df.sort () Sort by value in column
Select or slice Df.name Output Specific content Df[] do not output specific content, output specific content with the Show method
df["name"] do not output specific content, output specific content with the Show method
df[] Output specific content,
df["name"] Output specific content
Df.select () Select one or more columns
Df.select ("name")
Slice df.select (df[' name '], df[' age ']+1)
DF[0]
DF.IX[0]
Df.first ()
Df.head (2) Df.head (2) or Df.take (2)
Df.tail (2)
Slice Df.ix[:3] or df.ix[: "XX"] or df[: "XX"]
Df.loc[] Select by label
df.iloc[] Select by location
Filter df[df[' age ']>21] Df.filter (df[' age ']>21) or df.where (df[' age ']>21)
Integration Df.groupby ("Age")
Df.groupby ("A"). AVG ("B")
Df.groupby ("Age")
Df.groupby ("A"). AVG ("B"). Show () apply a single function
From Pyspark.sql import functions
Df.groupby ("A"). Agg (Functions.avg ("B"), Functions.min ("B"), Functions.max ("B")). Show () apply multiple functions
Statistics Df.count () outputs the number of non-empty rows for each column Df.count () Output Total rows
Df.describe () Describes the count, mean, STD, min, 25%, 50%, 75%, max for some columns Df.describe () Describes the count, mean, StdDev, Min, max for some columns
Merge Concat method under Pandas, supports axial merging
Merge method under Pandas, support multi-column merge
The same name is automatically added suffix, the corresponding key is reserved only one copy
The Join method under Spark is Df.join ()
The same name does not automatically add suffixes, only the key values exactly match to keep a copy
Df.join () supports multi-column merging
Df.append () supports multi-line merging
Missing data processing Automatically add NaNs to missing data Do not automatically add NaNs and do not throw errors
Fillna function: Df.fillna () Fillna function: Df.na.fill ()
Dropna function: Df.dropna () Dropna function: Df.na.drop ()
SQL statements Import Sqlite3
Pd.read_sql ("Select name, age from People WHERE age >= and age <= 19″)
Form registration: Registering the DATAFRAME structure as a SQL statement usage type
Df.registertemptable ("people") or sqlcontext.registerdataframeastable (DF, "people")
Sqlcontext.sql ("Select name, age from People WHERE age >= and age <= 19″)
Feature registration: Registering functions as SQL statement usage type
Sqlcontext.registerfunction ("stringlengthstring", Lambda X:len (x))
Sqlcontext.sql ("Select stringlengthstring (' Test ')")
They convert each other. PANDAS_DF = Spark_df.topandas () SPARK_DF = Sqlcontext.createdataframe (PANDAS_DF)
function application Df.apply (f) Apply function f to each column of DF Df.foreach (f) or Df.rdd.foreach (f) Apply the function f to each column of DF
Df.foreachpartition (f) or df.rdd.foreachPartition (f) applies each block of DF to the application function f
Map-reduce operation Map (func, list), Reduce (func, list) return type seq Df.map (func), Df.reduce (func) return type Seqrdds
diff operation There is a diff operation to process the time series data (pandas will compare the current line to the previous row) No diff operation (the upper and lower lines of spark are independent of each other, distributed storage)

Reprint connection: Http://www.lining0806.com/spark and Pandas in dataframe contrast/

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.