DataFrame Learning Summary in Spark SQL

Source: Internet
Author: User

Dataframe more information about the structure of the data. is the schema.

The RDD is a collection of distributed Java objects. Dataframe is a collection of distributed row objects.

DataFrame provides detailed structural information that allows Sparksql to know clearly what columns are contained in the dataset, and what are the names and types of the columns?

The RDD is a collection of distributed Java objects. Dataframe is a collection of distributed row objects. Dataframe In addition to providing a

Beyond the more-than-rdd operators, the more important feature is improved execution efficiency, reduced data reads, and optimization of execution plans, such as

Filter push, cut, etc.

Improve execution efficiency

The RDD API is functional and emphasizes immutability, preferring to create new objects rather than modifying old objects in most scenarios. This feature, while

Brings clean and tidy APIs, but also makes spark applications prone to creating a large number of temporary objects at run time, which puts pressure on the GC. In

Based on the existing RDD API, we can, of course, use the Mappartitions method to overload the creation of data within a single shard of the RDD

The cost of object allocation and GC is reduced by reusing mutable objects, but this sacrifices the readability of the code and requires the developer to

Run-time mechanism has a certain understanding, high threshold. Spark SQL, on the other hand, has been as heavy as possible within the framework

With an object, this will break the invariance internally, but it will also revert to immutable data when the data is returned to the user. Philip

Developed with the DataFrame API, these optimizations can be enjoyed for free.

Reduce data read

The quickest way to analyze big data is to ignore it. "Ignore" here is not a blind eye, but based on the query conditions

When the pruning.

The partitioning pruning mentioned above for the partitioning table is one of them--when the query's filter condition involves partitioning columns, we can root

Reduce IO by cutting out the partition directory that definitely does not contain the target data, according to the query criteria.

For some "smart" data formats, Spark SQL can also be pruned based on the statistics that are included with the data file. Simple to

In this type of data format, the data is stored in segments, with each piece of data with the maximum, minimum, and number of null values, and some basic

Statistical information. When a data segment of a statistic table name does not include the target data that meets the query criteria, the data segment can be directly

Skip (for example, the maximum value for a segment of an integer column is 100, and the query condition requires a > 200).

In addition, Spark SQL can take advantage of the rcfile, ORC, parquet, and other columnstore formats to scan only what the query really covers

column, ignoring the data for the remaining columns.

A dataset can be considered a special case of dataframe, the main difference being that each record in a dataset stores a strongly typed value without

is a row. Therefore, it has the following three characteristics:

Datasets can check for types at compile time

and an object-oriented programming interface.


Dataframe is a sparksql-oriented interface.

Dataframe and datasets can be converted to each other.

Df.as[elementtype] So you can convert dataframe into a dataset,

DS.TODF () can then convert the dataset to Dataframe.


This article is from the "Star Moon Love" blog, please be sure to keep this source http://xuegodxingyue.blog.51cto.com/5989753/1964917

DataFrame Learning Summary in Spark SQL

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.