DataFrame Understanding

Last Update:2020-06-10 Source: Internet

Author: User

Keywords dataframe pandas dataframe pandas dataframe tutorial

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

DataFrame is not proposed by Spark SQL, but is available in Pandas
DataSet: distributed data set
DataFrame: a distributed data set (RDD with schema) composed of columns
Can be converted from various sources, such as RDD, SQL, noSQL, etc.
Abstracted

DataFrame vs. RDD

DataFrame has specific column information

Operational efficiency:
RDD: java/scala => jvm
Python's own operating environment
DataFrame: No matter which language is the same logic plan

DataFrame API:

printschema() outputs a tree structure
show() outputs the content. The number of output can be limited in brackets
Select(COLUMN_NAME) to query all data in a column
Comprehensive application:
peopleDF.select(peopleDF.col("name"), (peopleDF.col("age") + 5).as("age after 5 years")).show()
After finding two columns and performing an operation on one of them, change the column name

filter:
filter()
peopleDF.filter(peopleDF.col("age")> 24).show()

Grouping:
groupBy()
peopleDF.groupBy("age").count().show()

Convert to temporary view (for SQL operation):
createOrReplaceTempView() can be converted to sql API for operation

DataFrame and RDD interoperability:
Two kinds

All must first import SparkSession as an entrance
val spark = SparkSession.builder().appName("DataFrameRDD").master("local[2]").getOrCreate()

The first: reflection
The code is concise, the premise is that you need to know the structure of the schema
With the help of case class, define the fields corresponding to the schema in this class
Create case class, write according to schema
Generate RDD, with the help of TextFile of SparkContext, get the file and convert it to RDD, String type
Import Spark.Implicits._ implicit conversion package
Split RDD, split method, after splitting into String array, and corresponding to case class (that is, the corresponding variable is passed into the class, remember to type conversion before entering)
toDF method to generate DataFrame

Code:
//Define case class
case class Info(id: Int, name: String, age: Int) {}
//Generate RDD
val rdd = spark.sparkContext.textFile("file:////usr/local/mycode/info.txt")
//Cut, classify, convert
val infoDF = rdd.map(_.split(",")).map(line => Info(line(0).toInt, line(1), line(2).toInt)).toDF()
ps: If the separator is | or other, it may be necessary to add the escape character \\

The second: directly build Dataset (dynamic)
Use without knowing the schema
First convert to Rows, combined with StructType, the amount of code is larger
Generate RDD
Split RDD, same as step 4 of the first method, and then convert to RowsRDD
Define StructType, use an array Array to define, the Type of each variable is defined by StructField
Use the createDataFrame method to associate RDD and StructType

Code:
//Generate RDD
val rdd = spark.sparkContext.textFile("file:////usr/local/mycode/info.txt")
//Split, turn into rowRDD
val rowRdd = rdd.map(_.split(",")).map(line => Row(line(0).toInt, line(1), line(2).toInt))
//Define StructType
val structType = StructType(Array(StructField("id", IntegerType, true),
StructField("name", StringType, true),
StructField("age", IntegerType, true)))
//Associate rowRDD and StructType
val infoDF = spark.createDataFrame(rowRdd, structType)

DataFrame API details:
Show method:
Only the first 20 items are displayed by default, you can specify larger
If there is too much information, a part of the default display will be intercepted. If set to false, it will not be intercepted

take method:
take() returns the previous n rows of records
take().foreach branch display

First, head method:
First few lines

select method:
Can select multiple columns

filter method:
Other fields can be added to the condition, such as substring, you can search for the line where a certain number of characters in the line value is equal to the specified value
studentDF.filter("substr(name, 0, 1) ='M'").show

Sort method:
There is desc sorting
studentDF.sort(studentDF.col("name").desc, studentDF.col("id").desc).show

As method:
studentDF.select(studentDF.col("name").as("studentName")).show

Join method:
studentDF.join(studentDF2, studentDF.col("id") === studentDF2.col("id"))
Use three = signs when judging equality

Dataset:

Appeared for the first time in the 1.6 version. There is Spark SQL optimization. You can use lambda expressions, but you cannot use the Dataset API in Python.

DF = DS[Row]
DS strongly typed case class
DF: Weakly typed Row

Method to read csv file into DataFrame:
val salesDF = spark.read.option("header", "true").option("inferSchema", "true").csv(path)
header refers to parsing the header file, so that you can know the column name
inferSchema is to get the attributes of each column

DF to DS method:
Create case class
as method

val salesDS = salesDF.as[Sales]
case class Sales(transactionId: Int, customerId: Int, itemId: Int, amountPaid: Double)
Select a column of output:
salesDS.map(line => line.itemId).show()

The difference between SQL, DF, DS
The timing of the error is different. DS is the most sensitive and can detect errors earlier, even if the column name is wrong.
(When compiling, SQL commands and column names will not report errors; DF commands will report errors, but column names will not report errors. If no errors are reported, errors will be reported at runtime.)

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More