Spark Partition Details! DT Big Data Dream Factory Liaoliang teacher personally explain!

Source: Internet
Author: User

Spark Partition Details! DT Big Data Dream Factory Liaoliang teacher personally explain!


Http://www.tudou.com/home/_79823675/playlist?qq-pf-to=pcqq.group


What is the difference between a shard and a partition?

Sharding is from the point of view of the data, the partition is calculated from the point of view , actually are from the large state, split into small.


Second, spark partition understanding

The RDD, as a distributed dataset, is distributed across multiple worker nodes. As shown, RDD1 has five partitions (partition), which are distributed over four worker nodes, RDD2 three partitions, distributed over three worker nodes.

650) this.width=650; "src=" Https://pic3.zhimg.com/20049c7cecf2107389107e42881b844e_b.jpg "alt=" 20049c7cecf2107389107e42881b844e_b.jpg "/>

Third, default partition

In general, the number of blocks stored in the HDFs file is the size of the partition, but sometimes a record kua block, then there will be more than one case, and the block will be slightly greater than or less than 128MB.


Four, re-partition

You want to repartition the RDD, in two cases, when creating an rdd and when you get a new rdd through a transform operation.

For the former, you can manually specify the number of partitions when calling the Textfile and Parallelize methods. For example Sc.parallelize (Array (1, 2, 3, 5, 6), 2) specifies that the number of RDD partitions created is 2.


For the latter, call the Rdd.repartition method directly, if you want to specifically control which data are distributed on which partitions, you can pass a ordering in. For example, I want the data to be randomly distributed into 10 partitions, which can:

Class Myordering[t] extends ordering[t]{
def compare (x:t,y:t) = Math.random Compare Math.random
}

Assuming that the data is of type int
Rdd.repartition (new Myordering[int])



In fact, the number of partitions is based on the conversion operation corresponding to the dependency between multiple rdd, the narrow dependent child rdd is determined by the number of parent RDD partitions, such as the map operation, the parent Rdd and the child RDD partition number consistent; Shuffle dependency is determined by the partition (Partitioner), for example Groupbykey (New Hashpartitioner (2)), or direct Groupbykey (2), the number of newly-RDD partitions equals 2.





Spark Partition Details! DT Big Data Dream Factory Liaoliang teacher personally explain!

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.