Data partitioning of the spark key-value pair operation (ii)

Source: Internet
Author: User
Tags hash shuffle
1. Data partitioning

To reduce the cost of distributed application communication, control data partitioning for minimal network transmission
All key values in spark can be partitioned for RDD

There are requirements for users to access their non-subscribed pages
Statistics to better recommend content to the user. There is a large User information table (userid,userinfo) to make up the RDD, where UserInfo contains a list of topics that the user subscribes to, and the app is periodically combined with a small file (Userid,linkinfo). This small file stores the user's access to the site in the last five minutes.

Scala
//initialization code; Reading user information from a Hadoop sequencefile in an HDFs/
//UserData elements are distributed according to the source in which they were read, that is, the node where the HDFs block resides
// Spark does not know at this time which node the corresponding record for a particular userid is located on,
val sc = new Sparkcontext (...)
Val userData = Sc.sequencefile[userid, UserInfo] ("hdfs://..."). Persist ()
//Periodic call function to handle event logs generated in the last five minutes
// Suppose this is a sequencefile def processnewlogs (UserID, LinkInfo) pair (
logfilename:string) {
    val events = Sc.sequencefile[userid, LinkInfo] (LogFileName)
    val joined = userdata.join (events)//RDD of (UserID, UserInfo, LinkInfo)) Pairs
    val offtopicvisits = joined.filter {case
        (userId, (UserInfo, linkinfo)) = =! UserInfo.topics.contains (linkinfo.topic)
    }.count ()
    println ("Number of visits to non-subscribed topics:" + offtopicvisits)
}

problems with the code:
Each time processnewlogs () is called, the following actions are performed, first calculating the hash value of all the keys of the two datasets, then uploading the same hash record over the network to the same machine, and then connecting all the records with the same key. Because UserData Rdd is quite large, and in time Userdatardd unchanged, but still need to userdatardd hash value calculation and shuffle.

Solution Solutions
Convert the table to a hash partition using Partitionby () on the UserData table

Constructs 100 partitions
val userData = Sc.sequencefile[userid, UserInfo] ("hdfs://..."). Partitionby (New Hashpartitioner (100)) . Persist ()

Some other actions by spark automatically set known partitioning information for the resulting rdd, and there are many operations that take advantage of existing partitioning information, such as Sortbykey () and Groupbykey (), map (), which can cause the new Rdd to lose the parent RDD. Partition Information 1.1 get the Rdd partition method

Val pairs = Sc.parallelize (List ((1, 1), (2, 2), (3, 3)))
Pairs.partitioner
val partitioned = Pairs.partitionby (NE W Spark. Hashpartitioner (2))
Partitioned.partitioner

If you use partitioned in the future, then persist () operation is necessary, otherwise the subsequent RDD operation will be 1.2 of the entire pedigree of partitioned to benefit from the operation of the partition

The process of shuffle data across nodes is the benefit of data partitioning
Cogroup (),groupwith (),join (),leftouterjoin (),rightouterjoin (), Groupbykey (),Reducebykey (),Combinebykey (),lookup ()

Reducebykey because the protocol operation at the end of the local to the master node, so the network overhead is small, for Cogroup () and join such a two-dollar operation, pre-partitioning the data will cause at least one of the RDD does not occur shuffle 1.3 Operations that affect partitioning methods

The operation inside Spark knows how to affect the partitioning method, and the result of the operation on the data rdd automatically sets the corresponding partitioner, if the join to connect two rdd, because the same key elements will be hashed to the same machine, Spark knows the output is also a hash partition, This reducebykey () results in a much faster operation

However, because the keys may be changed, the map does not have a fixed partitioning method, but mapvalues () and flatmapvalues () can be used as an alternative

Here's how to set up partitioning for the resulting RDD
Cogroup (), Groupwith (), join (), Lef touterjoin (), Rightouterjoin (), Groupbykey (), Reducebykey (), Combinebykey (), Partitionby (), sort (), mapvalues ()

For a two-dollar operation, the way the output data is partitioned depends on how the parent Rdd is partitioned. By default, the result is a hash partition with the same number of partitions as the parallelism of the operation. However, if one of the parent RDD has already set an overly-zonal pattern, the result will be the partitioning method, and if the two parent Rdd is set to an overly-zonal way, the RDD will be partitioned by the first parent RDD. 1.3 Customizing the Partitioning method

Http://www.cnn.com/WORLD and Http://www.cnn.com/US are assigned to different nodes when they are hashed, but to avoid shuffle, we need to customize the partitioning method

Import Urlparse
def hash_domain (URL):
    return hash (urlparse.urlparse (URL). Netloc)
Rdd.partitionby (20, Hash_domain) # Create 20 partitions

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.