Spark Partition Details! DT Big Data Dream Factory Liaoliang teacher personally explain!
Http://www.tudou.com/home/_79823675/playlist?qq-pf-to=pcqq.group
What is the difference between a shard and a partition?
Sharding is from the point of view of the data, the partition is calculated from the point of view , actually are from the large state, split into small.
Second, spark partition understanding
The RDD, as a distributed dataset, is distributed across multiple worker nodes. As shown, RDD1 has five partitions (partition), which are distributed over four worker nodes, RDD2 three partitions, distributed over three worker nodes.
650) this.width=650; "src=" Https://pic3.zhimg.com/20049c7cecf2107389107e42881b844e_b.jpg "alt=" 20049c7cecf2107389107e42881b844e_b.jpg "/>
Third, default partition
In general, the number of blocks stored in the HDFs file is the size of the partition, but sometimes a record kua block, then there will be more than one case, and the block will be slightly greater than or less than 128MB.
Four, re-partition
You want to repartition the RDD, in two cases, when creating an rdd and when you get a new rdd through a transform operation.
For the former, you can manually specify the number of partitions when calling the Textfile and Parallelize methods. For example Sc.parallelize (Array (1, 2, 3, 5, 6), 2) specifies that the number of RDD partitions created is 2.
For the latter, call the Rdd.repartition method directly, if you want to specifically control which data are distributed on which partitions, you can pass a ordering in. For example, I want the data to be randomly distributed into 10 partitions, which can:
Class Myordering[t] extends ordering[t]{
def compare (x:t,y:t) = Math.random Compare Math.random
}
Assuming that the data is of type int
Rdd.repartition (new Myordering[int])
In fact, the number of partitions is based on the conversion operation corresponding to the dependency between multiple rdd, the narrow dependent child rdd is determined by the number of parent RDD partitions, such as the map operation, the parent Rdd and the child RDD partition number consistent; Shuffle dependency is determined by the partition (Partitioner), for example Groupbykey (New Hashpartitioner (2)), or direct Groupbykey (2), the number of newly-RDD partitions equals 2.
Spark Partition Details! DT Big Data Dream Factory Liaoliang teacher personally explain!