Spark kernel secret -10-rdd source analysis

Source: Internet
Author: User

The core approach to RDD:






First look at the source code of the GetPartitions method:


GetPartitions returns a collection of partitions, which is an array of type partition

We just want to get into the HADOOPRDD implementation:


1, getjobconf (): Used to obtain the job configuration, get configured with clone and non-clone mode, but the clone mode is not Thread-safe, default is forbidden, non-clone mode can be obtained from the cache, Create a new one if not in the cache, and then put it in the cache

2. Enter Getinputformcat (jobconf) method:


3. Enter Inputformat.getsplits (jobconf, minpartitions) method:


Enter the Getsplits method of the Fileinputformcat class:




5. Enter Hadooppartition:



The getdependencies expression is a dependency between the Rdd, as follows:


Getdependencies returns a SEQ collection of dependencies in which the underscore in the dependency array is of type placeholder

We enter the Getdependencies method in the Shuffledrdd class:


We enter the Shuffledependency class:


Each RDD will have a computed function, as follows:


We enter the compute method of Hadoopmappartitionswithsplitrdd:


The compute method is calculated for each partition of the RDD, and the source code for the Taskcontext parameter is as follows:


Getpreferredlocations is the preferred location for finding partition:


We enter Newhadooprdd's getpreferredlocations:



In fact, the RDD also has an optional partitioning strategy:


The source code of Partitioner is as follows:





It can be seen that the default is to use Hashpartitioner, note that key is an array case;

Spark.default.parallelism must be set up, otherwise the RDD will be transmitted according to partitions data, which will also be prone to oom


Spark kernel secret -10-rdd source analysis

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.