The Rdd, DataFrame, and dataset in Spark are the data collection abstractions of Spark, and the RDD is for each object, but DF and DS are for row
RDD
Advantages:
Compile-Time type safety
The type error can be checked at compile time
Object-oriented Programming style
Manipulate data directly from the class name point
Disadvantages:
Performance overhead for serialization and deserialization
Whether the communication between the clusters or IO operations requires serialization and deserialization of the object's structure and data
GC's performance overhead, frequent creation and destruction of objects, is bound to increase GC overhead
DataFrame
Dataframe introduces schema and Off-heap
Schema:rdd each row of data, the structure is the same, the structure is stored in the schema, spark through the schame can read the data, so in communication and IO only need to serialize and deserialize the data, and the structure of the part can be omitted
Off-heap: means memory outside the JVM heap, which is directly managed by the operating system (not the JVM), and Spark is able to serialize the data in binary form (excluding the structure) into the off-heap, and when the data is to be manipulated, the off-heap memory is directly manipulated. Because Spark understands the schema, it knows how to do it
Off-heap is like a site, the schema is like a map, Spark has a map and its own territory, you can decide, no longer be restricted by the JVM, also no longer take GC trouble
Through the schema and off-heap,dataframe to solve the shortcomings of the RDD, but lost the advantages of RDD, Dataframe is not type-safe, API is not object-oriented style
DataSet
The dataset combines the benefits of RDD and Dataframe, and brings a new concept encoder
When serializing data, encoder produces bytecode that interacts with off-heap to achieve the effect of on-demand access to the data without deserializing the entire object
Spark has not yet provided a custom encoder API, but will join in the future
The difference between rdd--dataframe--dataset in Sparksql