What are the characteristics/attributes of the RDD?
1) There is a list of shards that can be segmented, and like Hadoop, the data that can be sliced can be computed in parallel.
A set of shards (partition), the basic constituent unit of a dataset, for the Rdd, each shard is processed by a compute task and determines the granularity of the parallel computation. The user can specify the number of shards of the RDD when creating the RDD, and if not specified, the default value is used. The default value is the number of CPU cores that the program is assigned to. Each allocated storage is implemented by Blockmanager, and each partition is logically mapped to a block of Blockmanager, and the block is computed by a task.
2) Each shard is computed by a function, which refers to the compute function mentioned below.
The calculation of the RDD in Spark is in shards, and each RDD implements the COMPUTE function for this purpose. The compute function will compound the iterator without having to save the results of each calculation.
3) Dependency lists on other Rdd, dependencies are also specific to wide dependencies and narrow dependencies, but not all rdd are dependent.
Each conversion of the RDD generates a new RDD, so there is a pipeline-like dependency between the Rdd. In the case of partial partition data loss, spark can recalculate the lost partition data through this dependency, instead of recalculating all the partitions of the RDD.
4) Optional: The Key-value type Rdd is based on the hash partition, similar to the Paritioner interface in MapReduce, to control which of the key points to reduce.
A partitioner, which is the Shard function of the RDD. There are two types of shard functions implemented in current spark, one is hash-based Hashpartitioner and the other is a range-based rangepartitioner. Only for Key-value of the RDD, there will be partitioner, non-key-value rdd partitioner value is none. The Partitioner function not only determines the number of shards in the RDD itself, but also determines the number of shards when the parent Rdd shuffle output.
5) Optional: The priority location of each shard, such as the location of the block in HDFs, should be the preferred location for the calculation.
A list that stores access to each partition priority (preferred location). For an HDFs file, this list holds the location of the block where each partition is located. According to the concept of "mobile data is not as good as mobile computing", when the task is scheduled, spark allocates the calculation task to the storage location of the data block it is working on.
What are the characteristics/attributes of Apache Spark Rdd?