RDD is the most important abstract concept provided by Spark. It is a special data collection with fault tolerance mechanism, which has the following five characteristics.
1) There is a list of shards, that is, it can be divided, and like Hadoop, the data that can be divided can be calculated in parallel.
A partition (partition), that is, the basic unit of the data set, for RDD, each partition will be processed by a computing task, and determines the granularity of parallel computing. The user can specify the number of RDD fragments when creating the RDD. If not specified, the default value will be used. The default value is the number of CPU Cores allocated by the program. Each allocated storage is implemented by BlockManager, and each partition will be logically mapped to a Block of BlockManager, and this Block will be calculated by a Task.
2) A function calculates each shard. This refers to the compute function mentioned below.
The calculation of RDD in Spark is in units of shards, and each RDD will implement a compute function to achieve this purpose. The compute function will compound the iterator without saving the result of each calculation.
3) Dependency list for other RDDs. Dependencies are divided into broad and narrow dependencies, but not all RDDs have dependencies.
Each conversion of RDD will generate a new RDD, so there will be a pipeline-like dependency between RDDs. When part of the partition data is lost, Spark can recalculate the lost partition data through this dependency instead of recalculating all RDD partitions.
4) Optional: The key-value RDD is partitioned according to the hash, similar to the partitioner interface in mapreduce, which controls which reduce the key is divided into.
A partitioner, namely RDD's slicing function. Two types of sharding functions are currently implemented in Spark, one is a hash-based HashPartitioner, and the other is a range-based RangePartitioner. Only the RDD for key-value will have a Partitioner, and the value of the non-key-value RDD Partitioner is None. The Partitioner function not only determines the number of shards in the RDD itself, but also determines the number of shards when the parent RDD Shuffle is output.
5) Optional: The priority calculation location of each fragment, for example, the location of the HDFS block should be the priority calculation location.
A list to store the preferred location for accessing each Partition. For an HDFS file, this list stores the location of the block where each Partition is located. According to the concept of "mobile data is not as good as mobile computing", when scheduling tasks, Spark will allocate computing tasks as much as possible to the storage location of the data blocks it processes.
Features of RDD:
It is an immutable, partitioned collection object on a cluster node.
Create such as (map, filter, join, etc) by parallel conversion.
Failure to automatically rebuild.
You can control the storage level (memory, disk, etc.) for reuse.
Must be serializable.
Is statically typed.
Further, say:
There are many Excutors in the worker, the one that actually completes the calculation is the Excutor, which is calculated in memory,
There is a partitioner in the Excutor, and the data in the partitioner is put into the memory if the memory is large enough, it is read little by little.
RDD is a distributed data set, which is called RDD.
RDD has 5 characteristics:
1.a list of partiotioner has many partiotioners (there are 3 partiotioner here), it can be clearly said that one partition is on one machine, one partition is actually placed on the memory of one machine, and there can be on one machine Multiple partitions.
2.a function for partiotioner A function acts on a partition. For example, if a partition has 1, 2, 3 in rdd1.map(_*10), each element in the RDD is taken out and multiplied by 10, and the map function is applied to each shard.
3. There is a series of dependencies between RDD1.map(_*10).flatMap(..).map(..).reduceByKey(...), which is built into DAG, this DAG will be constructed in many stages, These stages are called stages, and there will be dependencies between RDD stages, which will be constructed later based on the previous dependencies. If the previous data is lost, it will remember the previous dependencies and recover from the front. Each operator will generate a new RDD. textFile and flatMap will generate two RDDs.
4. Partitioner hash & Integer.Max% partiotioner decides which partition the data is in, optional, only when the RDD is key-value
5. The best location. The data is on the machine, the task is on the machine, and the data is on the local, without going to the network. However, when the data is finally summarized, it is necessary to go online. (Block block of hdfs file)
RDD has 5 characteristics:
1. RDD is the core abstraction provided by Spark, which is called Resillient Distributed Dataset, that is, elastic distributed data set.
2. RDD is abstractly a collection of elements that contains data. It is partitioned, divided into multiple partitions, each partition is distributed on different nodes in the cluster, so that the data in the RDD can be operated in parallel. (Distributed data set)
3. RDDs are usually created from files on Hadoop, that is, HDFS files or Hive tables; sometimes they can also be created from collections in applications.
4. The most important feature of RDD is that it provides fault tolerance and can automatically recover from node failure. That is, if the RDD partition on a node is lost due to node failure, then RDD will automatically recalculate the partition through its own data source. All this is transparent to users.
5. The RDD data is stored in memory by default, but when the memory resources are insufficient, Spark will automatically write the RDD data to disk. (elasticity)