Spark RDD Principle: How to Operate to Improve Efficiency

Last Update:2020-06-04 Source: Internet

Author: User

Keywords spark spark rdd spark rdd principle

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

RDD, or "Resilient Distributed Dataset", is a basic data structure of Spark. It is an immutable distributed collection of objects. Each data set in RDD is divided into several different logical partitions, which can be calculated according to different cluster nodes. RDD can contain any type of Python, Java or Scala objects, including user-defined classes.

Under normal circumstances, an RDD is a read-only, divisible set of records. You can create an RDD by stabilizing data in storage or other explicit operations such as RDD. RDD is a collection of elements that are fault-tolerant and can perform parallel operations on them.

There are usually two ways to create an RDD: parallelize an existing collection in your driver; or reference a data set in an external storage system, such as a shared file system, HDFS, HBase, or any other format that provides Hadoop input data source.

Spark uses the RDD concept to achieve fast and efficient MapReduce operations. In this article, let us first discuss how the MapReduce operation occurs and why the operation is inefficient.

MapReduce data sharing is too slow

MapReduce is widely used to process and generate large data sets using parallel distributed algorithms on a cluster. It allows users to write parallel computing code using a set of advanced operators without having to worry about work distribution and fault tolerance.

Unfortunately, in the most popular big data processing framework nowadays, the only way to reuse data between calculations (eg between two MapReduce jobs) is to write it to an external stable storage system (eg HDFS) in. Although this framework may provide many abstract technologies to access the cluster's computing resources, users still need more support.

Both iterative and interactive applications require faster data sharing across multiple parallel jobs. Due to replication, serialization, and disk IO, data sharing in MapReduce is slow. Regarding storage systems, most Hadoop applications spend more than 90% of their time implementing HDFS read and write operations.

Iterative operations in MapReduce

Currently, we can reuse intermediate calculation results across multiple calculations in a multi-level application. Due to frequent data replication, disk I/O, and serialization operations, such a scheme will bring a lot of overhead, which makes the system run slower.

In addition, users can run ad-hoc queries on the same subset of data. Each query will perform disk I/O operations on stable storage, which can control the execution time of the application.
The working mode of the current popular big data framework for interactive query on MapReduce.

Use Spark RDD to share data

Due to replication, serialization, and disk IO, data sharing is slow on MapReduce. Most Hadoop applications spend more than 90% of their time performing HDFS read and write operations.

Recognizing this problem, the researchers developed a specialized framework called Apache Spark. The core idea of Spark is the elastic distributed data set (RDD); it supports processing calculations in memory. This means that it can store memory state as an object across different jobs, and the object can be shared between these jobs. The data in the shared memory is 10 to 100 times faster than the data in the network and disk.

Next, let us analyze how iterative and interactive operations are performed in Spark RDD.

Iterative operations on Spark RDD

In this scheme, distributed memory is used to store intermediate results instead of using stable storage (usually disk), which makes the system run faster.

[Note] If the distributed memory that stores the intermediate results (the status of the job) is not enough, it will store these results on disk.

If there are different queries that need to be run repeatedly on the same data set, then this part of the specific data can be saved in memory in order to strive for faster execution time.

By default, each converted RDD may be recalculated-this is the case whenever an operation is run on it. However, you may also keep an RDD in memory; at this time, Spark will try to arrange these elements around the cluster, so as to achieve faster access speed when querying it next time. In addition, it also supports storing RDDs on disk or copying RDDs across multiple nodes.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More