The core approach to RDD:
First look at the source code of the GetPartitions method:
GetPartitions returns a collection of partitions, which is an array of type partition
We just want to get into the HADOOPRDD implementation:
1, getjobconf (): Used to obtain the job configuration, get configured with clone and non-clone mode, but the clone mode is not Thread-safe, default is forbidden, non-clone mode can be obtained from the cache, Create a new one if not in the cache, and then put it in the cache
2. Enter Getinputformcat (jobconf) method:
3. Enter Inputformat.getsplits (jobconf, minpartitions) method:
Enter the Getsplits method of the Fileinputformcat class:
5. Enter Hadooppartition:
The getdependencies expression is a dependency between the Rdd, as follows:
Getdependencies returns a SEQ collection of dependencies in which the underscore in the dependency array is of type placeholder
We enter the Getdependencies method in the Shuffledrdd class:
We enter the Shuffledependency class:
Each RDD will have a computed function, as follows:
We enter the compute method of Hadoopmappartitionswithsplitrdd:
The compute method is calculated for each partition of the RDD, and the source code for the Taskcontext parameter is as follows:
Getpreferredlocations is the preferred location for finding partition:
We enter Newhadooprdd's getpreferredlocations:
In fact, the RDD also has an optional partitioning strategy:
The source code of Partitioner is as follows:
It can be seen that the default is to use Hashpartitioner, note that key is an array case;
Spark.default.parallelism must be set up, otherwise the RDD will be transmitted according to partitions data, which will also be prone to oom
Spark kernel secret -10-rdd source analysis