1. Partitioning
A partition is a computational unit of the RDD internal parallel computation, the data set of the RDD is logically divided into multiple shards, each of which is called a partition, and the format of the partition determines the granularity of the parallel computation, and the numerical computation of each partition is performed in one task, so the number of tasks is also done by the RDD ( The number of partitions that are exactly the last rdd of the job is determined. 2. Number of partitions
An RDD partitioning principle: As much as possible, the number of scoring areas equals the number of cluster cores
Below we discuss only the default number of partitions for spark, where the default number of partitions is specifically analyzed for parallelize and textfile
Whether it is local mode, standalone mode, yarn mode, or Mesos mode, we can configure the number of default partitions by Spark.default.parallelism, and if this value is not set, the value is determined according to the different cluster environment.
Local mode: Default is the number of CPUs on the local machine, if LOCAL[N] is set, the default is N Apache Mesos: The default number of partitions is 8 standalone or yarn: The default is to take the sum of all the cores in the cluster, or 2, to take the larger value of both conclusions: For Parallelize, there is no specified number of partitions in the method, the default is spark.default.parallelism for Textfile, there is no specified number of partitions in the method, and the default is min (defaultparallelism , 2), and defaultparallelism corresponds to Spark.default.parallelism. If the file is read from HDFs, the number of partitions is the number of file shards (128mb/slices)