1. Self-connect
Suppose the following file exists:
[root@bluejoe0 ~]# cat categories.csv 1,生活用品,02,数码用品,13,手机,24,华为Mate7,3
The format of each row is: Category ID, category name, parent class ID
Now to output the name of the parent category for each category, similar to the SQL self-join, notice that the foreign key of the join is actually the parent class ID.
First generate "Parent class Id-> subclass ID, subclass name"
val categories=sc.textFile("/root/categories.csv")val left = categories.map(_.split(",")).map(x=>(x(2)->Map("id"->x(0),"name"->x(1))))
The left content is:
Array (0,map (ID 1, name-and-Life), (1,map (ID-2, name-and digital)), (2,map (ID, 3, name--)), (3,m AP (ID, 4, name, Huawei Mate7)))
The parent class id-> parent class ID, the parent class name, is then generated
val right = categories.map(_.split(",")).map(x=>(x(0)->Map("pid"->x(0),"pname"->x(1))))
The contents of right are:
Array ((1,map (PID-1, PName-life products)), (2,map (PID-2, PName-Digital)), (3,map (PID---3, PName-Mobile)) , (4,map (PID, 4, PName, Huawei Mate7)))
Next, combine the two rdd and reduce by key (key is the parent class ID):
valmerged=(left++right).reduceByKey(_++_)
The contents of the merged are:
Array (4,map (PID-4, PName, Huawei Mate7)), (0,map (ID-1, name-to-Life)), (1,map (ID, 2, name-and digital goods, PID--1, pname---2,map (ID-3, name---Phone, PID---2, pname-and digital Goods)), (3,map (ID-4, NA Me--Huawei Mate7, PID---3, PName, mobile))
Get!!
You can use Flatmap to simplify the above wording:
val merged = categories.map(_.split(",")).flatMap((x)=>Array(x(2)->Map("id"->x(0),"name"->x(1)), x(0)->Map("pid"->x(0),"pname"->x(1)))).reduceByKey(_++_)
The result is the same!! Of course, the readability of the code is greatly discounted ~ ~ ~
2. Two-sheet connection
Copyright NOTICE: This article for Bo Master original article, without Bo Master permission not reproduced.
Join operation for a table based on spark