The Dataframe treatment method of "summary" Pyspark: Modification and deletion

Last Update:2018-07-26 Source: Internet

Author: User

Tags pyspark spark rdd

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Basic operations:
Get the Spark version number (in Spark 2.0.0 for example) at run time:

SPARKSN = SparkSession.builder.appName ("Pythonsql"). Getorcreate () Print sparksn.version

Create and CONVERT formats:
The dataframe of Pandas and Spark are converted to each other:

PANDAS_DF = Spark_df.topandas ()

SPARK_DF = Sqlcontext.createdataframe (PANDAS_DF)

Reciprocal conversion to spark RDD:

RDD_DF = Df.rdd

DF = RDD_DF.TODF ()

Note: Rdd DF is provided that each RDD type is row type

Increase:
New columns:

Df.withcolumn ("xx", 0). Show () will error, because there is no XX column

From pyspark.sql import Functions df = Df.withcolumn ("xx", Functions.lit (0)). Show ()

Fillna function:

Df.na.fill ()

To add columns based on the original column:

DF = df.withcolumn (' count20 ', df[' count ']-20) # New data listed as original column minus 20

By deleting:
Delete a column:

Df.drop (' age '). Collect () Df.drop (df.age). Collect ()

Dropna function:

DF = Df.na.drop () # Throw away any rows that contain NA
DF = Df.dropna (subset=[' col_name1 ', ' col_name2 ']) # throw away any row in col1 or col2 that contains NA

Change:
Modify all values of the original df["xx" column:

DF = Df.withcolumn ("xx", 1)

To modify the type of a column (type projection):

DF = Df.withcolumn ("year2", df["year1"].cast ("Int"))

Join method for merging 2 tables:

Df_join = Df_left.join (df_right, Df_left.key = = Df_right.key, "inner")

Among them, the method can be: ' Inner ', ' outer ', ' left_outer ', ' right_outer ', ' Leftsemi '.
GroupBy Method Consolidation:

Groupeddata = Df.groupby ("Age")

Apply a single function (grouped by the same name as column A, the mean value of column B is merged in the group):
Df.groupby ("A"). AVG ("B"). Show ()
Multiple functions are applied: from Pyspark.sql import functions Df.groupby ("A"). Agg (Functions.avg ("B"), Functions.min ("B") and Functions.max (" B ")). Show ()

Methods available for Groupeddata types after consolidation (both return dataframe types): AVG (*cols)--Calculates the average count () of one or more columns in each group--Calculates the total number of rows in each group, and returns dataframe with 2 columns , a group name that is grouped, another column Count Max (*cols)--Calculates the maximum value of one or more columns per group mean (*cols)-Calculates the average min (*cols) of one or more columns per group-calculates the minimum sum of one or more columns per group ( *cols)--Calculates the sum of one or more columns in each group
Function application applies the function f to each column of DF:

Df.foreach (f) or Df.rdd.foreach (f)

"Function Application" applies the function f to each block of DF:

Df.foreachpartition (f) or df.rdd.foreachPartition (f)

"Map and reduce application" return type Seqrdds

Df.map (func)
Df.reduce (func)

Resolves todf () an exception that cannot be determined by running out of the single rows type, either by unifying each element of the row into a format, or by judging the format processing, resolving the problem of converting to dataframe when the None type is included:

@staticmethod def map_convert_none_to_str (row): Dict_row = Row.asdict ()
For key in Dict_row:if key!= ' some_column_name ': value = Dict_row[key] If VA Lue is none:value_in = str ("") else:value_in = str (value) Dict_row[key] = value_in
columns = Dict_row.keys () v = dict_row.values () row = row (*columns) return row (*V)

Check: Line element Query operations: Print the top 20 elements of a list like SQL (specify the number of lines to print in the show function with the int type):

Df.show ()
Df.show (30)

Print an overview in the form of a tree

Df.printschema ()

Get the first few lines to the Local:

List = Df.head (3) # Example: [Row (A=1, b=1), Row (a=2, b=2), ...]
List = Df.take (5) # Example: [Row (A=1, b=1), Row (a=2, b=2), ...]

Output list type, each element in the list is a row class:

List = Df.collect ()

Note: This method imports all data to the local
Query total number of rows:

Int_num = Df.count ()

Query for a row that is column null:

From pyspark.sql.functions Import IsNull
DF = Df.filter (IsNull ("Col_a"))

Column element action: Gets all column names for the row element:

R = Row (age=11, name= ' Alice ')
Print r.__fields__ # [' Age ', ' name ']

Select one or more columns:

Df.select ("name") Df.select (df[' name '), df[' age ']+1) df.select (DF.A, df.b, DF.C) # Select a, B, C Df.select (df["A"], df["B"] , df["C"]) # Select a, B, c three columns

Sort:

DF = Df.sort ("Age", Ascending=false)

Filter data (same as filter and where method):

DF = Df.filter (df[' age ']>21)
DF = Df.where (df[' age ']>21)

# Filter for null or Nan data:
From pyspark.sql.functions import isNaN, isnull
DF = Df.filter (IsNull ("a")) # Filter the data in column A to null (representing the none type of Python)
DF = Df.filter (isNaN ("a")) # filter out the data in column A as Nan (not a number, non-numeric data)

SQL Operation: Dataframe registered as SQL table:

Df.createorreplacetempview ("TBL1")

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More