Problem
Datafrme provides a powerful join operation, but often finds it problematic to run into duplicate columns when operating. When you're not paying attention, there's a problem when you do other things with related columns!
If the two fields exist at the same time, the error will be as follows: Org.apache.spark.sql.AnalysisException:Reference ' Key2 ' is ambiguous
Instance
1. Create two instances of DF Demo
Val df = sc.parallelize (Array (" Yuwen", "Zhangsan", "$"), ("Yuwen", "Lisi", "All"), ("Shuxue", "Zhangsan",), ("Shuxue" , "Lisi,")). TODF ("course", "name", "Score")
Display: Df.show ()
Val df2 = Sc.parallelize ((" Yuwen", "Zhangsan", "Max"), ("Shuxue", "Zhangsan",)). TODF ("course", "Name", " Score ")
Display: Df2.show
Associated query:
Val joined = Df.join (DF2, DF ("cource") = = = DF2 ("Cource") && DF ("name") = = = DF2 ("name"), "Left_outer")
Results show:
This is where the problem arises. There are three 22 identical fields in this place, and when you manipulate this field, you have a problem.
Solve the problem
1. You can use the time you specify which DF field you want to use
Joined.select (DF ("course"), DF ("name")). Show
Results:
2. You can delete the extra columns, in the actual situation you will not be able to associate two identical tables, usually the names of several fields are the same, so you can delete the fields you do not need
Joined.drop (DF2 ("name"))
Results:
3. It is entirely possible to avoid this problem by modifying the expression of the join. Mainly through the SEQ object to achieve
Df.join (DF2, Seq ("course", "name")). Show ()
Results:
Transferred from: https://www.cnblogs.com/chushiyaoyue/p/6927488.html
Spark issues with duplicate columns after join (org.apache.spark.sql.AnalysisException:Reference ' * ' is ambiguous)