DataFrame Learning Summary in Spark SQL

Last Update:2017-09-13 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Dataframe more information about the structure of the data. is the schema.

The RDD is a collection of distributed Java objects. Dataframe is a collection of distributed row objects.

DataFrame provides detailed structural information that allows Sparksql to know clearly what columns are contained in the dataset, and what are the names and types of the columns?

The RDD is a collection of distributed Java objects. Dataframe is a collection of distributed row objects. Dataframe In addition to providing a

Beyond the more-than-rdd operators, the more important feature is improved execution efficiency, reduced data reads, and optimization of execution plans, such as

Filter push, cut, etc.

Improve execution efficiency

The RDD API is functional and emphasizes immutability, preferring to create new objects rather than modifying old objects in most scenarios. This feature, while

Brings clean and tidy APIs, but also makes spark applications prone to creating a large number of temporary objects at run time, which puts pressure on the GC. In

Based on the existing RDD API, we can, of course, use the Mappartitions method to overload the creation of data within a single shard of the RDD

The cost of object allocation and GC is reduced by reusing mutable objects, but this sacrifices the readability of the code and requires the developer to

Run-time mechanism has a certain understanding, high threshold. Spark SQL, on the other hand, has been as heavy as possible within the framework

With an object, this will break the invariance internally, but it will also revert to immutable data when the data is returned to the user. Philip

Developed with the DataFrame API, these optimizations can be enjoyed for free.

Reduce data read

The quickest way to analyze big data is to ignore it. "Ignore" here is not a blind eye, but based on the query conditions

When the pruning.

The partitioning pruning mentioned above for the partitioning table is one of them--when the query's filter condition involves partitioning columns, we can root

Reduce IO by cutting out the partition directory that definitely does not contain the target data, according to the query criteria.

For some "smart" data formats, Spark SQL can also be pruned based on the statistics that are included with the data file. Simple to

In this type of data format, the data is stored in segments, with each piece of data with the maximum, minimum, and number of null values, and some basic

Statistical information. When a data segment of a statistic table name does not include the target data that meets the query criteria, the data segment can be directly

Skip (for example, the maximum value for a segment of an integer column is 100, and the query condition requires a > 200).

In addition, Spark SQL can take advantage of the rcfile, ORC, parquet, and other columnstore formats to scan only what the query really covers

column, ignoring the data for the remaining columns.

A dataset can be considered a special case of dataframe, the main difference being that each record in a dataset stores a strongly typed value without

is a row. Therefore, it has the following three characteristics:

Datasets can check for types at compile time

and an object-oriented programming interface.

Dataframe is a sparksql-oriented interface.

Dataframe and datasets can be converted to each other.

Df.as[elementtype] So you can convert dataframe into a dataset,

DS.TODF () can then convert the dataset to Dataframe.

This article is from the "Star Moon Love" blog, please be sure to keep this source http://xuegodxingyue.blog.51cto.com/5989753/1964917

DataFrame Learning Summary in Spark SQL

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

DataFrame Learning Summary in Spark SQL

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

DataFrame Learning Summary in Spark SQL

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support