The Dataframe treatment method of "summary" Pyspark: Modification and deletion

Source: Internet
Author: User
Tags pyspark spark rdd
Basic operations:
Get the Spark version number (in Spark 2.0.0 for example) at run time:
SPARKSN = SparkSession.builder.appName ("Pythonsql"). Getorcreate () Print sparksn.version


Create and CONVERT formats:
The dataframe of Pandas and Spark are converted to each other:
PANDAS_DF = Spark_df.topandas () SPARK_DF = Sqlcontext.createdataframe (PANDAS_DF)

Reciprocal conversion to spark RDD:
RDD_DF = Df.rdd DF = RDD_DF.TODF ()
Note: Rdd DF is provided that each RDD type is row type

Increase:
New columns:
Df.withcolumn ("xx", 0). Show () will error, because there is no XX column

From pyspark.sql import Functions df = Df.withcolumn ("xx", Functions.lit (0)). Show ()

Fillna function:
Df.na.fill ()

To add columns based on the original column:
DF = df.withcolumn (' count20 ', df[' count ']-20) # New data listed as original column minus 20

By deleting:
Delete a column:
Df.drop (' age '). Collect () Df.drop (df.age). Collect ()

Dropna function:
DF = Df.na.drop () # Throw away any rows that contain NA
DF = Df.dropna (subset=[' col_name1 ', ' col_name2 ']) # throw away any row in col1 or col2 that contains NA


Change:
Modify all values of the original df["xx" column:
DF = Df.withcolumn ("xx", 1)

To modify the type of a column (type projection):
DF = Df.withcolumn ("year2", df["year1"].cast ("Int"))

Join method for merging 2 tables:
Df_join = Df_left.join (df_right, Df_left.key = = Df_right.key, "inner")
Among them, the method can be: ' Inner ', ' outer ', ' left_outer ', ' right_outer ', ' Leftsemi '.
GroupBy Method Consolidation:
Groupeddata = Df.groupby ("Age")

Apply a single function (grouped by the same name as column A, the mean value of column B is merged in the group):
Df.groupby ("A"). AVG ("B"). Show ()
Multiple functions are applied: from Pyspark.sql import functions Df.groupby ("A"). Agg (Functions.avg ("B"), Functions.min ("B") and Functions.max (" B ")). Show ()
Methods available for Groupeddata types after consolidation (both return dataframe types): AVG (*cols)--Calculates the average count () of one or more columns in each group--Calculates the total number of rows in each group, and returns dataframe with 2 columns , a group name that is grouped, another column Count Max (*cols)--Calculates the maximum value of one or more columns per group mean (*cols)-Calculates the average min (*cols) of one or more columns per group-calculates the minimum sum of one or more columns per group ( *cols)--Calculates the sum of one or more columns in each group
Function application applies the function f to each column of DF:
Df.foreach (f) or Df.rdd.foreach (f)

"Function Application" applies the function f to each block of DF:
Df.foreachpartition (f) or df.rdd.foreachPartition (f)

"Map and reduce application" return type Seqrdds
Df.map (func)
Df.reduce (func)

Resolves todf () an exception that cannot be determined by running out of the single rows type, either by unifying each element of the row into a format, or by judging the format processing, resolving the problem of converting to dataframe when the None type is included:
@staticmethod def map_convert_none_to_str (row): Dict_row = Row.asdict ()
For key in Dict_row:if key!= ' some_column_name ': value = Dict_row[key] If VA                 Lue is none:value_in = str ("") else:value_in = str (value) Dict_row[key] = value_in
columns = Dict_row.keys () v = dict_row.values () row = row (*columns) return row (*V)


Check: Line element Query operations: Print the top 20 elements of a list like SQL (specify the number of lines to print in the show function with the int type):
Df.show ()
Df.show (30)

Print an overview in the form of a tree
Df.printschema ()

Get the first few lines to the Local:
List = Df.head (3) # Example: [Row (A=1, b=1), Row (a=2, b=2), ...]
List = Df.take (5) # Example: [Row (A=1, b=1), Row (a=2, b=2), ...]

Output list type, each element in the list is a row class:
List = Df.collect ()
Note: This method imports all data to the local
Query total number of rows:
Int_num = Df.count ()

Query for a row that is column null:
From pyspark.sql.functions Import IsNull
DF = Df.filter (IsNull ("Col_a"))


Column element action: Gets all column names for the row element:
R = Row (age=11, name= ' Alice ')
Print r.__fields__ # [' Age ', ' name ']

Select one or more columns:
Df.select ("name") Df.select (df[' name '), df[' age ']+1) df.select (DF.A, df.b, DF.C) # Select a, B, C Df.select (df["A"], df["B"] , df["C"]) # Select a, B, c three columns

Sort:
DF = Df.sort ("Age", Ascending=false)

Filter data (same as filter and where method):
DF = Df.filter (df[' age ']>21)
DF = Df.where (df[' age ']>21)

# Filter for null or Nan data:
From pyspark.sql.functions import isNaN, isnull
DF = Df.filter (IsNull ("a")) # Filter the data in column A to null (representing the none type of Python)
DF = Df.filter (isNaN ("a")) # filter out the data in column A as Nan (not a number, non-numeric data)


SQL Operation: Dataframe registered as SQL table:
Df.createorreplacetempview ("TBL1")

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.