1. Data partitioning
To reduce the cost of distributed application communication, control data partitioning for minimal network transmission
All key values in spark can be partitioned for RDD
There are requirements for users to access their non-subscribed pages
Statistics to better recommend content to the user. There is a large User information table (userid,userinfo) to make up the RDD, where UserInfo contains a list of topics that the user subscribes to, and the app is periodically combined with a small file (Userid,linkinfo). This small file stores the user's access to the site in the last five minutes.
Scala
//initialization code; Reading user information from a Hadoop sequencefile in an HDFs/
//UserData elements are distributed according to the source in which they were read, that is, the node where the HDFs block resides
// Spark does not know at this time which node the corresponding record for a particular userid is located on,
val sc = new Sparkcontext (...)
Val userData = Sc.sequencefile[userid, UserInfo] ("hdfs://..."). Persist ()
//Periodic call function to handle event logs generated in the last five minutes
// Suppose this is a sequencefile def processnewlogs (UserID, LinkInfo) pair (
logfilename:string) {
val events = Sc.sequencefile[userid, LinkInfo] (LogFileName)
val joined = userdata.join (events)//RDD of (UserID, UserInfo, LinkInfo)) Pairs
val offtopicvisits = joined.filter {case
(userId, (UserInfo, linkinfo)) = =! UserInfo.topics.contains (linkinfo.topic)
}.count ()
println ("Number of visits to non-subscribed topics:" + offtopicvisits)
}
problems with the code:
Each time processnewlogs () is called, the following actions are performed, first calculating the hash value of all the keys of the two datasets, then uploading the same hash record over the network to the same machine, and then connecting all the records with the same key. Because UserData Rdd is quite large, and in time Userdatardd unchanged, but still need to userdatardd hash value calculation and shuffle.
Solution Solutions
Convert the table to a hash partition using Partitionby () on the UserData table
Constructs 100 partitions
val userData = Sc.sequencefile[userid, UserInfo] ("hdfs://..."). Partitionby (New Hashpartitioner (100)) . Persist ()
Some other actions by spark automatically set known partitioning information for the resulting rdd, and there are many operations that take advantage of existing partitioning information, such as Sortbykey () and Groupbykey (), map (), which can cause the new Rdd to lose the parent RDD. Partition Information 1.1 get the Rdd partition method
Val pairs = Sc.parallelize (List ((1, 1), (2, 2), (3, 3)))
Pairs.partitioner
val partitioned = Pairs.partitionby (NE W Spark. Hashpartitioner (2))
Partitioned.partitioner
If you use partitioned in the future, then persist () operation is necessary, otherwise the subsequent RDD operation will be 1.2 of the entire pedigree of partitioned to benefit from the operation of the partition
The process of shuffle data across nodes is the benefit of data partitioning
Cogroup (),groupwith (),join (),leftouterjoin (),rightouterjoin (), Groupbykey (),Reducebykey (),Combinebykey (),lookup ()
Reducebykey because the protocol operation at the end of the local to the master node, so the network overhead is small, for Cogroup () and join such a two-dollar operation, pre-partitioning the data will cause at least one of the RDD does not occur shuffle 1.3 Operations that affect partitioning methods
The operation inside Spark knows how to affect the partitioning method, and the result of the operation on the data rdd automatically sets the corresponding partitioner, if the join to connect two rdd, because the same key elements will be hashed to the same machine, Spark knows the output is also a hash partition, This reducebykey () results in a much faster operation
However, because the keys may be changed, the map does not have a fixed partitioning method, but mapvalues () and flatmapvalues () can be used as an alternative
Here's how to set up partitioning for the resulting RDD
Cogroup (), Groupwith (), join (), Lef touterjoin (), Rightouterjoin (), Groupbykey (), Reducebykey (), Combinebykey (), Partitionby (), sort (), mapvalues ()
For a two-dollar operation, the way the output data is partitioned depends on how the parent Rdd is partitioned. By default, the result is a hash partition with the same number of partitions as the parallelism of the operation. However, if one of the parent RDD has already set an overly-zonal pattern, the result will be the partitioning method, and if the two parent Rdd is set to an overly-zonal way, the RDD will be partitioned by the first parent RDD. 1.3 Customizing the Partitioning method
Http://www.cnn.com/WORLD and Http://www.cnn.com/US are assigned to different nodes when they are hashed, but to avoid shuffle, we need to customize the partitioning method
Import Urlparse
def hash_domain (URL):
return hash (urlparse.urlparse (URL). Netloc)
Rdd.partitionby (20, Hash_domain) # Create 20 partitions