Spark:dataframe and RDD

Last Update:2018-08-03 Source: Internet

Author: User

Tags memory usage in python

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

The Dataframe and Rdd in Spark is a confusing concept for beginners. The following is a Berkeley Spark course learning note that records

The similarities and differences between Dataframe and RDD.

First look at the explanation of the official website:

DataFrame: in Spark, DataFrame is a distributed dataset organized as a named column, equivalent to a table in a relational database, and to the data frames in R/python (but with more optimizations).

Rdd:The RDD is a distributed data set that is scattered across all the machines in a distributed cluster.

The following diagram visually identifies the differences between the two structures on the left side of the Rdd[person] Although the person is the type parameter, the Spark framework itself does not understand the internal structure of the person class. The Dataframe on the right provides detailed structure information so that spark SQL can clearly know which columns are contained in the dataset, and what each column's name and type are. Dataframe more information about the structure of the data, that is, the schema.

one, the same point1. Cannot be modified after creation: Spark emphasizes immutability, and in most scenarios tends to create new objects instead of modifying old objects. 2. By viewing data manipulation records, you can efficiently recalculate lost data. 3. Support for the operation of distributed data. Both 4.DataFrame and RDD can be created from the following data sources: (1) A Python dataset for distributed storage. (2) Dataframe or other spark datasets in pandas. (3) files in HDFs or other file systems. Both 5.DataFrame and RDD have two types of operation: tranformations and actions.
Second, the difference
The RDD can programmatically set the number of partition blocks for a dataset (using default values when not set, the default is the number of CPU cores The program is assigned to). More partitioning means better distribution.

Advantages of DataframeSpark is optimized for dataframe in terms of execution time and memory usage relative to RDD. The Catalyst optimization Engine reduces the execution time of the dataframe75%. Project Tungsten Off-heap menory management reduced memory usage by 75%.Catalyst Optimization Engine

The upper-right corner of the figure represents the time taken to aggregate the million integer pairs in Python, Scala, and Dataframe. We can see that using Python and Scala to execute the RDD is significantly slower than dataframe. Scala is faster than Python when executing an rdd. But for Dataframe, there is no difference between the two languages. This is because Python is an interpreted language, and Scala is a compiled language. In the green section, for better execution, Catalyst will manipulate Scala and Python's dataframecompilingAs a physical plan and generated a JVM bytecode, Python does not explain the process. The result is that the two languages have the same performance in general. At the same time, both performance is better than the average Python RDD implementation of 4 times times, also reached the Scala RDD implementation of twice times. The table at the lower left shows the effect of the catalyst optimization, and the catalyst-optimized code is significantly faster than the interpreted code, and is similar to the manual encoding speed.

Project Tungsten Off-heap menory Management

The green portion of the upper-right corner of the image above is the memory space occupied by using Dataframe, which is three-fourths less than the RDD. The lower-left corner shows the results of the tungsten optimization.

As mentioned above, Dataframe has obvious advantages in time and space for Rdd, so how do we choose Dataframe and Rdd?1. When to use DataframeStructured or semi-structured data operates on data, such as transformations and actions. Requires faster execution, less execution space2. When to use the RddUnstructured data, such as audio and video media, text data streams. Fewer operations are required on the data. You want to work with the data in a functional programming structure.
In general, try to choose Dataframe.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More