Spark Basics Essay: Partition summary

Source: Internet
Author: User
Tags apache mesos
1. Partitioning

A partition is a computational unit of the RDD internal parallel computation, the data set of the RDD is logically divided into multiple shards, each of which is called a partition, and the format of the partition determines the granularity of the parallel computation, and the numerical computation of each partition is performed in one task, so the number of tasks is also done by the RDD ( The number of partitions that are exactly the last rdd of the job is determined. 2. Number of partitions

An RDD partitioning principle: As much as possible, the number of scoring areas equals the number of cluster cores

Below we discuss only the default number of partitions for spark, where the default number of partitions is specifically analyzed for parallelize and textfile

Whether it is local mode, standalone mode, yarn mode, or Mesos mode, we can configure the number of default partitions by Spark.default.parallelism, and if this value is not set, the value is determined according to the different cluster environment.

Local mode: Default is the number of CPUs on the local machine, if LOCAL[N] is set, the default is N Apache Mesos: The default number of partitions is 8 standalone or yarn: The default is to take the sum of all the cores in the cluster, or 2, to take the larger value of both conclusions: For Parallelize, there is no specified number of partitions in the method, the default is spark.default.parallelism for Textfile, there is no specified number of partitions in the method, and the default is min (defaultparallelism , 2), and defaultparallelism corresponds to Spark.default.parallelism. If the file is read from HDFs, the number of partitions is the number of file shards (128mb/slices)




Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.