Spark RDD Principle: How to Operate to Improve Efficiency
Source: Internet
Author: User
Keywordsspark spark rdd spark rdd principle
RDD, or "Resilient Distributed Dataset", is a basic data structure of
Spark. It is an immutable distributed collection of objects. Each data set in RDD is divided into several different logical partitions, which can be calculated according to different cluster nodes. RDD can contain any type of Python, Java or Scala objects, including user-defined classes.
Under normal circumstances, an RDD is a read-only, divisible set of records. You can create an RDD by stabilizing data in storage or other explicit operations such as RDD. RDD is a collection of elements that are fault-tolerant and can perform parallel operations on them.
There are usually two ways to create an RDD: parallelize an existing collection in your driver; or reference a data set in an external storage system, such as a shared file system, HDFS, HBase, or any other format that provides Hadoop input data source.
Spark uses the RDD concept to achieve fast and efficient MapReduce operations. In this article, let us first discuss how the MapReduce operation occurs and why the operation is inefficient.
MapReduce data sharing is too slow
MapReduce is widely used to process and generate large data sets using parallel distributed algorithms on a cluster. It allows users to write parallel computing code using a set of advanced operators without having to worry about work distribution and fault tolerance.
Unfortunately, in the most popular big data processing framework nowadays, the only way to reuse data between calculations (eg between two MapReduce jobs) is to write it to an external stable storage system (eg HDFS) in. Although this framework may provide many abstract technologies to access the cluster's computing resources, users still need more support.
Both iterative and interactive applications require faster data sharing across multiple parallel jobs. Due to replication, serialization, and disk IO, data sharing in MapReduce is slow. Regarding storage systems, most Hadoop applications spend more than 90% of their time implementing HDFS read and write operations.
Iterative operations in MapReduce
Currently, we can reuse intermediate calculation results across multiple calculations in a multi-level application. Due to frequent data replication, disk I/O, and serialization operations, such a scheme will bring a lot of overhead, which makes the system run slower.
In addition, users can run ad-hoc queries on the same subset of data. Each query will perform disk I/O operations on stable storage, which can control the execution time of the application.
The working mode of the current popular big data framework for interactive query on MapReduce.
Use Spark RDD to share data
Due to replication, serialization, and disk IO, data sharing is slow on MapReduce. Most Hadoop applications spend more than 90% of their time performing HDFS read and write operations.
Recognizing this problem, the researchers developed a specialized framework called Apache Spark. The core idea of Spark is the elastic distributed data set (RDD); it supports processing calculations in memory. This means that it can store memory state as an object across different jobs, and the object can be shared between these jobs. The data in the shared memory is 10 to 100 times faster than the data in the network and disk.
Next, let us analyze how iterative and interactive operations are performed in Spark RDD.
Iterative operations on Spark RDD
In this scheme, distributed memory is used to store intermediate results instead of using stable storage (usually disk), which makes the system run faster.
[Note] If the distributed memory that stores the intermediate results (the status of the job) is not enough, it will store these results on disk.
If there are different queries that need to be run repeatedly on the same data set, then this part of the specific data can be saved in memory in order to strive for faster execution time.
By default, each converted RDD may be recalculated-this is the case whenever an operation is run on it. However, you may also keep an RDD in memory; at this time, Spark will try to arrange these elements around the cluster, so as to achieve faster access speed when querying it next time. In addition, it also supports storing RDDs on disk or copying RDDs across multiple nodes.
The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion;
products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the
content of the page makes you feel confusing, please write us an email, we will handle the problem
within 5 days after receiving your email.
If you find any instances of plagiarism from the community, please send an email to:
info-contact@alibabacloud.com
and provide relevant evidence. A staff member will contact you within 5 working days.