Spark:dataframe and RDD

Source: Internet
Author: User
Tags memory usage in python


The Dataframe and Rdd in Spark is a confusing concept for beginners. The following is a Berkeley Spark course learning note that records

The similarities and differences between Dataframe and RDD.


First look at the explanation of the official website:

DataFrame: in Spark, DataFrame is a distributed dataset organized as a named column, equivalent to a table in a relational database, and to the data frames in R/python (but with more optimizations).

Rdd:The RDD is a distributed data set that is scattered across all the machines in a distributed cluster.


The following diagram visually identifies the differences between the two structures on the left side of the Rdd[person] Although the person is the type parameter, the Spark framework itself does not understand the internal structure of the person class. The Dataframe on the right provides detailed structure information so that spark SQL can clearly know which columns are contained in the dataset, and what each column's name and type are. Dataframe more information about the structure of the data, that is, the schema.


one, the same point1. Cannot be modified after creation: Spark emphasizes immutability, and in most scenarios tends to create new objects instead of modifying old objects. 2. By viewing data manipulation records, you can efficiently recalculate lost data. 3. Support for the operation of distributed data.   Both 4.DataFrame and RDD can be created from the following data sources: (1) A Python dataset for distributed storage.   (2) Dataframe or other spark datasets in pandas. (3) files in HDFs or other file systems. Both 5.DataFrame and RDD have two types of operation: tranformations and actions.
Second, the difference
The RDD can programmatically set the number of partition blocks for a dataset (using default values when not set, the default is the number of CPU cores The program is assigned to). More partitioning means better distribution.



Advantages of DataframeSpark is optimized for dataframe in terms of execution time and memory usage relative to RDD. The Catalyst optimization Engine reduces the execution time of the dataframe75%. Project Tungsten Off-heap menory management reduced memory usage by 75%.Catalyst Optimization Engine


The upper-right corner of the figure represents the time taken to aggregate the million integer pairs in Python, Scala, and Dataframe. We can see that using Python and Scala to execute the RDD is significantly slower than dataframe. Scala is faster than Python when executing an rdd. But for Dataframe, there is no difference between the two languages. This is because Python is an interpreted language, and Scala is a compiled language. In the green section, for better execution, Catalyst will manipulate Scala and Python's dataframecompilingAs a physical plan and generated a JVM bytecode, Python does not explain the process. The result is that the two languages have the same performance in general. At the same time, both performance is better than the average Python RDD implementation of 4 times times, also reached the Scala RDD implementation of twice times. The table at the lower left shows the effect of the catalyst optimization, and the catalyst-optimized code is significantly faster than the interpreted code, and is similar to the manual encoding speed.



Project Tungsten Off-heap menory Management



The green portion of the upper-right corner of the image above is the memory space occupied by using Dataframe, which is three-fourths less than the RDD. The lower-left corner shows the results of the tungsten optimization.

As mentioned above, Dataframe has obvious advantages in time and space for Rdd, so how do we choose Dataframe and Rdd?1. When to use DataframeStructured or semi-structured data operates on data, such as transformations and actions. Requires faster execution, less execution space2. When to use the RddUnstructured data, such as audio and video media, text data streams. Fewer operations are required on the data. You want to work with the data in a functional programming structure.
In general, try to choose Dataframe.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.