Dataframe more information about the structure of the data. is the schema.
The RDD is a collection of distributed Java objects. Dataframe is a collection of distributed row objects.
DataFrame provides detailed structural information that allows Sparksql to know clearly what columns are contained in the dataset, and what are the names and types of the columns?
The RDD is a collection of distributed Java objects. Dataframe is a collection of distributed row objects. Dataframe In addition to providing a
Beyond the more-than-rdd operators, the more important feature is improved execution efficiency, reduced data reads, and optimization of execution plans, such as
Filter push, cut, etc.
Improve execution efficiency
The RDD API is functional and emphasizes immutability, preferring to create new objects rather than modifying old objects in most scenarios. This feature, while
Brings clean and tidy APIs, but also makes spark applications prone to creating a large number of temporary objects at run time, which puts pressure on the GC. In
Based on the existing RDD API, we can, of course, use the Mappartitions method to overload the creation of data within a single shard of the RDD
The cost of object allocation and GC is reduced by reusing mutable objects, but this sacrifices the readability of the code and requires the developer to
Run-time mechanism has a certain understanding, high threshold. Spark SQL, on the other hand, has been as heavy as possible within the framework
With an object, this will break the invariance internally, but it will also revert to immutable data when the data is returned to the user. Philip
Developed with the DataFrame API, these optimizations can be enjoyed for free.
Reduce data read
The quickest way to analyze big data is to ignore it. "Ignore" here is not a blind eye, but based on the query conditions
When the pruning.
The partitioning pruning mentioned above for the partitioning table is one of them--when the query's filter condition involves partitioning columns, we can root
Reduce IO by cutting out the partition directory that definitely does not contain the target data, according to the query criteria.
For some "smart" data formats, Spark SQL can also be pruned based on the statistics that are included with the data file. Simple to
In this type of data format, the data is stored in segments, with each piece of data with the maximum, minimum, and number of null values, and some basic
Statistical information. When a data segment of a statistic table name does not include the target data that meets the query criteria, the data segment can be directly
Skip (for example, the maximum value for a segment of an integer column is 100, and the query condition requires a > 200).
In addition, Spark SQL can take advantage of the rcfile, ORC, parquet, and other columnstore formats to scan only what the query really covers
column, ignoring the data for the remaining columns.
A dataset can be considered a special case of dataframe, the main difference being that each record in a dataset stores a strongly typed value without
is a row. Therefore, it has the following three characteristics:
Datasets can check for types at compile time
and an object-oriented programming interface.
Dataframe is a sparksql-oriented interface.
Dataframe and datasets can be converted to each other.
Df.as[elementtype] So you can convert dataframe into a dataset,
DS.TODF () can then convert the dataset to Dataframe.
This article is from the "Star Moon Love" blog, please be sure to keep this source http://xuegodxingyue.blog.51cto.com/5989753/1964917
DataFrame Learning Summary in Spark SQL