The difference between rdd--dataframe--dataset in Sparksql

Source: Internet
Author: User
Tags serialization

The Rdd, DataFrame, and dataset in Spark are the data collection abstractions of Spark, and the RDD is for each object, but DF and DS are for row

RDD

Advantages:
Compile-Time type safety
The type error can be checked at compile time
Object-oriented Programming style
Manipulate data directly from the class name point

Disadvantages:
Performance overhead for serialization and deserialization
Whether the communication between the clusters or IO operations requires serialization and deserialization of the object's structure and data
GC's performance overhead, frequent creation and destruction of objects, is bound to increase GC overhead

DataFrame
Dataframe introduces schema and Off-heap

Schema:rdd each row of data, the structure is the same, the structure is stored in the schema, spark through the schame can read the data, so in communication and IO only need to serialize and deserialize the data, and the structure of the part can be omitted

Off-heap: means memory outside the JVM heap, which is directly managed by the operating system (not the JVM), and Spark is able to serialize the data in binary form (excluding the structure) into the off-heap, and when the data is to be manipulated, the off-heap memory is directly manipulated. Because Spark understands the schema, it knows how to do it

Off-heap is like a site, the schema is like a map, Spark has a map and its own territory, you can decide, no longer be restricted by the JVM, also no longer take GC trouble

Through the schema and off-heap,dataframe to solve the shortcomings of the RDD, but lost the advantages of RDD, Dataframe is not type-safe, API is not object-oriented style

DataSet
The dataset combines the benefits of RDD and Dataframe, and brings a new concept encoder

When serializing data, encoder produces bytecode that interacts with off-heap to achieve the effect of on-demand access to the data without deserializing the entire object
Spark has not yet provided a custom encoder API, but will join in the future

The difference between rdd--dataframe--dataset in Sparksql

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.