Basic operations:
Get the Spark version number (in Spark 2.0.0 for example) at run time:
SPARKSN = SparkSession.builder.appName ("Pythonsql"). Getorcreate () Print sparksn.version |
Create and CONVERT formats:
The dataframe of Pandas and Spark are converted to each other:
PANDAS_DF = Spark_df.topandas () |
SPARK_DF = Sqlcontext.createdataframe (PANDAS_DF) |
Reciprocal conversion to spark RDD:
RDD_DF = Df.rdd |
DF = RDD_DF.TODF () |
Note: Rdd DF is provided that each RDD type is row type
Increase:
New columns:
Df.withcolumn ("xx", 0). Show () will error, because there is no XX column
From pyspark.sql import Functions df = Df.withcolumn ("xx", Functions.lit (0)). Show () |
Fillna function:
To add columns based on the original column:
DF = df.withcolumn (' count20 ', df[' count ']-20) # New data listed as original column minus 20 |
By deleting:
Delete a column:
Df.drop (' age '). Collect () Df.drop (df.age). Collect () |
Dropna function:
DF = Df.na.drop () # Throw away any rows that contain NA DF = Df.dropna (subset=[' col_name1 ', ' col_name2 ']) # throw away any row in col1 or col2 that contains NA |
Change:
Modify all values of the original df["xx" column:
DF = Df.withcolumn ("xx", 1) |
To modify the type of a column (type projection):
DF = Df.withcolumn ("year2", df["year1"].cast ("Int")) |
Join method for merging 2 tables:
Df_join = Df_left.join (df_right, Df_left.key = = Df_right.key, "inner") |
Among them, the method can be: ' Inner ', ' outer ', ' left_outer ', ' right_outer ', ' Leftsemi '.
GroupBy Method Consolidation:
Groupeddata = Df.groupby ("Age")
Apply a single function (grouped by the same name as column A, the mean value of column B is merged in the group): Df.groupby ("A"). AVG ("B"). Show () Multiple functions are applied: from Pyspark.sql import functions Df.groupby ("A"). Agg (Functions.avg ("B"), Functions.min ("B") and Functions.max (" B ")). Show () |
Methods available for Groupeddata types after consolidation (both return dataframe types): AVG (*cols)--Calculates the average count () of one or more columns in each group--Calculates the total number of rows in each group, and returns dataframe with 2 columns , a group name that is grouped, another column Count Max (*cols)--Calculates the maximum value of one or more columns per group mean (*cols)-Calculates the average min (*cols) of one or more columns per group-calculates the minimum sum of one or more columns per group ( *cols)--Calculates the sum of one or more columns in each group
Function application applies the function f to each column of DF:
Df.foreach (f) or Df.rdd.foreach (f) |
"Function Application" applies the function f to each block of DF:
Df.foreachpartition (f) or df.rdd.foreachPartition (f) |
"Map and reduce application" return type Seqrdds
Df.map (func) Df.reduce (func) |
Resolves todf () an exception that cannot be determined by running out of the single rows type, either by unifying each element of the row into a format, or by judging the format processing, resolving the problem of converting to dataframe when the None type is included:
@staticmethod def map_convert_none_to_str (row): Dict_row = Row.asdict () For key in Dict_row:if key!= ' some_column_name ': value = Dict_row[key] If VA Lue is none:value_in = str ("") else:value_in = str (value) Dict_row[key] = value_in columns = Dict_row.keys () v = dict_row.values () row = row (*columns) return row (*V) |
Check: Line element Query operations: Print the top 20 elements of a list like SQL (specify the number of lines to print in the show function with the int type):
Print an overview in the form of a tree
Get the first few lines to the Local:
List = Df.head (3) # Example: [Row (A=1, b=1), Row (a=2, b=2), ...] List = Df.take (5) # Example: [Row (A=1, b=1), Row (a=2, b=2), ...] |
Output list type, each element in the list is a row class:
Note: This method imports all data to the local
Query total number of rows:
Query for a row that is column null:
From pyspark.sql.functions Import IsNull DF = Df.filter (IsNull ("Col_a")) |
Column element action: Gets all column names for the row element:
R = Row (age=11, name= ' Alice ') Print r.__fields__ # [' Age ', ' name '] |
Select one or more columns:
Df.select ("name") Df.select (df[' name '), df[' age ']+1) df.select (DF.A, df.b, DF.C) # Select a, B, C Df.select (df["A"], df["B"] , df["C"]) # Select a, B, c three columns |
Sort:
DF = Df.sort ("Age", Ascending=false) |
Filter data (same as filter and where method):
DF = Df.filter (df[' age ']>21) DF = Df.where (df[' age ']>21)
# Filter for null or Nan data: From pyspark.sql.functions import isNaN, isnull DF = Df.filter (IsNull ("a")) # Filter the data in column A to null (representing the none type of Python) DF = Df.filter (isNaN ("a")) # filter out the data in column A as Nan (not a number, non-numeric data) |
SQL Operation: Dataframe registered as SQL table:
Df.createorreplacetempview ("TBL1") |