DataFrame Understanding

Source: Internet
Author: User
Keywords dataframe pandas dataframe pandas dataframe tutorial
DataFrame is not proposed by Spark SQL, but is available in Pandas
DataSet: distributed data set
DataFrame: a distributed data set (RDD with schema) composed of columns
Can be converted from various sources, such as RDD, SQL, noSQL, etc.
Abstracted
 
DataFrame vs. RDD

DataFrame has specific column information
 
Operational efficiency:
RDD: java/scala => jvm
           Python's own operating environment
DataFrame: No matter which language is the same logic plan
 
 
DataFrame API:
 
printschema() outputs a tree structure
show() outputs the content. The number of output can be limited in brackets
Select(COLUMN_NAME) to query all data in a column
Comprehensive application:
peopleDF.select(peopleDF.col("name"), (peopleDF.col("age") + 5).as("age after 5 years")).show()
After finding two columns and performing an operation on one of them, change the column name
 
filter:
filter()
peopleDF.filter(peopleDF.col("age")> 24).show()
 
Grouping:
groupBy()
peopleDF.groupBy("age").count().show()
 
Convert to temporary view (for SQL operation):
createOrReplaceTempView() can be converted to sql API for operation
 
 
DataFrame and RDD interoperability:
Two kinds
 
All must first import SparkSession as an entrance
val spark = SparkSession.builder().appName("DataFrameRDD").master("local[2]").getOrCreate()
 
The first: reflection
The code is concise, the premise is that you need to know the structure of the schema
With the help of case class, define the fields corresponding to the schema in this class
Create case class, write according to schema
Generate RDD, with the help of TextFile of SparkContext, get the file and convert it to RDD, String type
Import Spark.Implicits._ implicit conversion package
Split RDD, split method, after splitting into String array, and corresponding to case class (that is, the corresponding variable is passed into the class, remember to type conversion before entering)
toDF method to generate DataFrame
 
Code:
//Define case class
case class Info(id: Int, name: String, age: Int) {}
//Generate RDD
val rdd = spark.sparkContext.textFile("file:////usr/local/mycode/info.txt")
//Cut, classify, convert
val infoDF = rdd.map(_.split(",")).map(line => Info(line(0).toInt, line(1), line(2).toInt)).toDF()
ps: If the separator is | or other, it may be necessary to add the escape character \\
 
 
The second: directly build Dataset (dynamic)
Use without knowing the schema
First convert to Rows, combined with StructType, the amount of code is larger
Generate RDD
Split RDD, same as step 4 of the first method, and then convert to RowsRDD
Define StructType, use an array Array to define, the Type of each variable is defined by StructField
Use the createDataFrame method to associate RDD and StructType
 
Code:
//Generate RDD
val rdd = spark.sparkContext.textFile("file:////usr/local/mycode/info.txt")
//Split, turn into rowRDD
val rowRdd = rdd.map(_.split(",")).map(line => Row(line(0).toInt, line(1), line(2).toInt))
//Define StructType
val structType = StructType(Array(StructField("id", IntegerType, true),
  StructField("name", StringType, true),
  StructField("age", IntegerType, true)))
//Associate rowRDD and StructType
val infoDF = spark.createDataFrame(rowRdd, structType)
 
 
 
 
DataFrame API details:
Show method:
Only the first 20 items are displayed by default, you can specify larger
If there is too much information, a part of the default display will be intercepted. If set to false, it will not be intercepted
 
take method:
take() returns the previous n rows of records
take().foreach branch display
 
First, head method:
First few lines
 
select method:
Can select multiple columns
 
filter method:
Other fields can be added to the condition, such as substring, you can search for the line where a certain number of characters in the line value is equal to the specified value
studentDF.filter("substr(name, 0, 1) ='M'").show
 
Sort method:
There is desc sorting
studentDF.sort(studentDF.col("name").desc, studentDF.col("id").desc).show
 
As method:
studentDF.select(studentDF.col("name").as("studentName")).show
 
Join method:
studentDF.join(studentDF2, studentDF.col("id") === studentDF2.col("id"))
Use three = signs when judging equality
 
 
 
 
Dataset:
 
Appeared for the first time in the 1.6 version. There is Spark SQL optimization. You can use lambda expressions, but you cannot use the Dataset API in Python.
 
DF = DS[Row]
DS strongly typed case class
DF: Weakly typed Row
 
 
Method to read csv file into DataFrame:
val salesDF = spark.read.option("header", "true").option("inferSchema", "true").csv(path)
header refers to parsing the header file, so that you can know the column name
inferSchema is to get the attributes of each column
 
DF to DS method:
Create case class
as method
 
val salesDS = salesDF.as[Sales]
case class Sales(transactionId: Int, customerId: Int, itemId: Int, amountPaid: Double)
Select a column of output:
salesDS.map(line => line.itemId).show()
 
The difference between SQL, DF, DS
The timing of the error is different. DS is the most sensitive and can detect errors earlier, even if the column name is wrong.
(When compiling, SQL commands and column names will not report errors; DF commands will report errors, but column names will not report errors. If no errors are reported, errors will be reported at runtime.)
Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.