DataFrame is not proposed by Spark SQL, but is available in Pandas
DataSet: distributed data set
DataFrame: a distributed data set (RDD with schema) composed of columns
Can be converted from various sources, such as RDD, SQL, noSQL, etc.
Abstracted
DataFrame vs. RDD
DataFrame has specific column information
Operational efficiency:
RDD: java/scala => jvm
Python's own operating environment
DataFrame: No matter which language is the same logic plan
DataFrame API:
printschema() outputs a tree structure
show() outputs the content. The number of output can be limited in brackets
Select(COLUMN_NAME) to query all data in a column
Comprehensive application:
peopleDF.select(peopleDF.col("name"), (peopleDF.col("age") + 5).as("age after 5 years")).show()
After finding two columns and performing an operation on one of them, change the column name
filter:
filter()
peopleDF.filter(peopleDF.col("age")> 24).show()
Grouping:
groupBy()
peopleDF.groupBy("age").count().show()
Convert to temporary view (for SQL operation):
createOrReplaceTempView() can be converted to sql API for operation
DataFrame and RDD interoperability:
Two kinds
All must first import SparkSession as an entrance
val spark = SparkSession.builder().appName("DataFrameRDD").master("local[2]").getOrCreate()
The first: reflection
The code is concise, the premise is that you need to know the structure of the schema
With the help of case class, define the fields corresponding to the schema in this class
Create case class, write according to schema
Generate RDD, with the help of TextFile of SparkContext, get the file and convert it to RDD, String type
Import Spark.Implicits._ implicit conversion package
Split RDD, split method, after splitting into String array, and corresponding to case class (that is, the corresponding variable is passed into the class, remember to type conversion before entering)
toDF method to generate DataFrame
Code:
//Define case class
case class Info(id: Int, name: String, age: Int) {}
//Generate RDD
val rdd = spark.sparkContext.textFile("file:////usr/local/mycode/info.txt")
//Cut, classify, convert
val infoDF = rdd.map(_.split(",")).map(line => Info(line(0).toInt, line(1), line(2).toInt)).toDF()
ps: If the separator is | or other, it may be necessary to add the escape character \\
The second: directly build Dataset (dynamic)
Use without knowing the schema
First convert to Rows, combined with StructType, the amount of code is larger
Generate RDD
Split RDD, same as step 4 of the first method, and then convert to RowsRDD
Define StructType, use an array Array to define, the Type of each variable is defined by StructField
Use the createDataFrame method to associate RDD and StructType
Code:
//Generate RDD
val rdd = spark.sparkContext.textFile("file:////usr/local/mycode/info.txt")
//Split, turn into rowRDD
val rowRdd = rdd.map(_.split(",")).map(line => Row(line(0).toInt, line(1), line(2).toInt))
//Define StructType
val structType = StructType(Array(StructField("id", IntegerType, true),
StructField("name", StringType, true),
StructField("age", IntegerType, true)))
//Associate rowRDD and StructType
val infoDF = spark.createDataFrame(rowRdd, structType)
DataFrame API details:
Show method:
Only the first 20 items are displayed by default, you can specify larger
If there is too much information, a part of the default display will be intercepted. If set to false, it will not be intercepted
take method:
take() returns the previous n rows of records
take().foreach branch display
First, head method:
First few lines
select method:
Can select multiple columns
filter method:
Other fields can be added to the condition, such as substring, you can search for the line where a certain number of characters in the line value is equal to the specified value
studentDF.filter("substr(name, 0, 1) ='M'").show
Sort method:
There is desc sorting
studentDF.sort(studentDF.col("name").desc, studentDF.col("id").desc).show
As method:
studentDF.select(studentDF.col("name").as("studentName")).show
Join method:
studentDF.join(studentDF2, studentDF.col("id") === studentDF2.col("id"))
Use three = signs when judging equality
Dataset:
Appeared for the first time in the 1.6 version. There is Spark SQL optimization. You can use lambda expressions, but you cannot use the Dataset API in Python.
DF = DS[Row]
DS strongly typed case class
DF: Weakly typed Row
Method to read csv file into DataFrame:
val salesDF = spark.read.option("header", "true").option("inferSchema", "true").csv(path)
header refers to parsing the header file, so that you can know the column name
inferSchema is to get the attributes of each column
DF to DS method:
Create case class
as method
val salesDS = salesDF.as[Sales]
case class Sales(transactionId: Int, customerId: Int, itemId: Int, amountPaid: Double)
Select a column of output:
salesDS.map(line => line.itemId).show()
The difference between SQL, DF, DS
The timing of the error is different. DS is the most sensitive and can detect errors earlier, even if the column name is wrong.
(When compiling, SQL commands and column names will not report errors; DF commands will report errors, but column names will not report errors. If no errors are reported, errors will be reported at runtime.)