The core concept in Spark is the RDD (elastic distributed DataSet), which has been widely used in recent years as data volumes continue to grow, and distributed cluster parallel computing (such as MapReduce, Dryad, etc.) is being used to handle growing data. Most of these excellent computational models have the advantages of good fault tolerance, strong scalability, load balancing and simple programming methods, so that they are favored by many enterprises and used by most users for large-scale data processing.
However, mapreduce these parallel computations are mostly based on the non-cyclic data flow model, that is, the data process involves reading data from a shared file system, making calculations, completing calculations, writing results to shared storage, and maintaining a high degree of parallelism between different compute nodes during the calculation. Such a dataflow model makes it impossible for an iterative algorithm that needs to reuse a particular set of data to run efficiently.
The RDD used by spark and Spark was developed to solve this problem, and spark used a specially designed data structure called an RDD. An important feature of the RDD is that distributed datasets can be reused in different parallel environments, which distinguishes spark from other parallel data flow model frameworks such as MapReduce.