1. Overview
DataFrame is a distributed data set, which can be understood as a table in a relational database, organized by fields and field types and field values in columns, and supports four languages, which can be understood in Scala API as: FataFrame=Dataset[ROW]
Note:
DataFrame was generated after V1.3, SchemaRDD before V1.3, and Dataset was added after V1.6
2.
DataFrame vs RDD differences:
<br>
<span style="font-size: 18px;">
Concept:
</span>
<span style="font-size: 18px;">
Both are distributed containers. DF understands that a table has Schema in addition to RDD data, and also supports complex data types (map..)
<br>
API:
</span>
<span style="font-size: 18px;">
DataFrame provides a richer API than RDD. Support map filter flatMap...
<br>
Data structure: RDD knows that the type has no structure, DF provides Schema information, which is conducive to optimization and good performance
<br>
Bottom layer: Based on the different operating environment, the Java/Scala API developed by RDD runs the underlying environment JVM,
<br>
</span>
<span style="font-size: 18px;">
DF is converted into a logical execution plan (locaical plan) and a physical execution plan (Physical Plan) in SparkSQL. It has a self-optimization function, and the performance difference is large.
<br>
</span>
3. json file operation
[hadoop@hadoop001 bin]$./spark-shell --master local[2] --jars ~/software/mysql-connector-java-5.1.34-bin.jar
- read json file
scala>val df = spark.read.json("file:///home/hadoop/data/people.json")
18/09/02 11:47:20 WARN ObjectStore: Failed to get database global_temp, returning NoSuchObjectException
df: org.apache.spark.sql.DataFrame = [age: bigint, name: string]
- Print schema information
scala> df.printSchema
<span style="font-size: 18px;">
root
<br>
|-- age: long (nullable = true) - field type is allowed to be empty
<br>
|-- name: string (nullable = true)
<br>
</span>
- Print field content
scala> df.show
<span style="font-size: 18px;">
+----+-------+
<br>
| age| name|
<br>
+----+-------+
<br>
|null|Michael|
<br>
| 30| Andy|
<br>
| 19| Justin|
<br>
+----+-------+
<br>
</span>
- Print query fields
scala> df.select("name").show
<span style="font-size: 18px;">
+-------+
<br>
| name|
<br>
+-------+
<br>
|Michael|
<br>
| Andy|
<br>
| Justin|
<br>
+-------+
<br>
</span>
- Single quotes, there is an implicit conversion
scala> df.select('name).show
<span style="font-size: 18px;">
+-------+
<br>
| name|
<br>
+-------+
<br>
|Michael|
<br>
| Andy|
<br>
| Justin|
<br>
+-------+
<br>
</span>
- Implicit conversion of double quotes is not recognized
scala> df.select("name).show
<console>:1: error: unclosed string literal
df.select("name).show
^
- Age calculation, NULL cannot be calculated
scala> df.select($"name",$"age" + 1).show
<span style="font-size: 18px;">
+-------+---------+
<br>
| name|(age + 1)|
<br>
+-------+---------+
<br>
|Michael| null|
<br>
| Andy| 31|
<br>
| Justin| 20|
<br>
+-------+---------+
<br>
</span>
- Age filtering
scala> df.filter($"age"> 21).show
<span style="font-size: 18px;">
+---+----+
<br>
|age|name|
<br>
+---+----+
<br>
| 30|Andy|
<br>
+---+----+
<br>
</span>
- Age grouping
scala> df.groupBy("age").count.show
<span style="font-size: 18px;">
+----+-----+
<br>
| age|count|
<br>
+----+-----+
<br>
| 19| 1|
<br>
|null| 1|
<br>
| 30| 1|
<br>
+----+-----+
<br>
</span>
- Create a temporary view
scala> df.createOrReplaceTempView("people")
scala>spark.sql("select * from people").show
<span style="font-size: 18px;">
+----+-------+
<br>
| age| name|
<br>
+----+-------+
<br>
|null|Michael|
<br>
| 30| Andy|
<br>
| 19| Justin|
<br>
+----+-------+
<br>
</span>