Table join Operation Based on spark
1. Self-connection
Assume that the following file exists:
[Root @ bluejoe0 ~] # Cat categories.csv 1, daily necessities, 02, digital necessities, 13, mobile phone, 24, Huawei Mate7, 3
The format of each row is: Category ID, category name, parent class ID
Now, we want to output the name of the parent category for each category, which is similar to SQL self-join. Note that the foreign key of join is actuallyParent class ID.
First, generate "parent class ID-> subclass ID, subclass name"
val categories=sc.textFile(/root/categories.csv)val left = categories.map(_.split(,)).map(x=>(x(2)->Map(id->x(0),name->x(1))))
Left content:
Array (0, Map (id-> 1, name-> supplies), (1, Map (id-> 2, name-> digital products )), (2, Map (id-> 3, name-> mobile phone), (3, Map (id-> 4, name-> Huawei Mate7 )))
Then generate "parent class ID-> parent class ID, parent class name"
val right = categories.map(_.split(,)).map(x=>(x(0)->Map(pid->x(0),pname->x(1))))
The content of right is:
Array (1, Map (pid-> 1, pname-> supplies), (2, Map (pid-> 2, pname-> digital products )), (3, Map (pid-> 3, pname-> mobile phone), (4, Map (pid-> 4, pname-> Huawei Mate7 )))
Next, merge the two RDDs andParent class ID) For reduce:
val merged = (left++right).reduceByKey(_++_)
Merged content:
Array (4, Map (pid-> 4, pname-> Huawei Mate7), (0, Map (id-> 1, name-> supplies )), (1, Map (id-> 2, name-> digital products, pid-> 1, pname-> supplies), (2, Map (id-> 3, name-> mobile phone, pid-> 2, pname-> digital products), (3, Map (id-> 4, name-> Huawei Mate7, pid-> 3, pname-> mobile phone )))
Done !!
You can use flatMap to simplify the above writing:
val merged = categories.map(_.split(,)).flatMap((x)=>Array(x(2)->Map(id->x(0),name->x(1)), x(0)->Map(pid->x(0),pname->x(1)))).reduceByKey(_++_)
The results are the same !! Of course, the readability of the Code is greatly reduced ~~~
2. Join two tables