When you use join for two dataframe in Spark SQL, the value of the field as a connection contains a null value . Because the meaning of the null representation is unknown, neither does it know that the comparison of null values in SQL with any other value (even if null) is never true. Therefore, when the connection operation is NULL = = NULL is not true, so the result does not appear in the record, that is, the left table of this record corresponds to the right of the value is null. Examples are as follows:
Table_a:
Date |
ServerID |
lvsection |
2018-03-04 |
1 |
10 |
2018-03-05 |
Null |
9 |
2018-03-07 |
5 |
15 |
Table_b:
Date |
ServerID |
Num |
2018-03-04 |
1 |
13 |
2018-03-05 |
Null |
4 |
2018-03-07 |
5 |
6 |
Dataset<Row> table_c = table_a.join(table_b, ScalaUtils.getScalaSeq(Lists.newArrayList("date","serverId")), "left")
Table_c:
Date |
ServerID |
lvsection |
Num |
2018-03-04 |
1 |
10 |
13 |
2018-03-05 |
Null |
9 |
Null |
2018-03-07 |
5 |
15 |
6 |
Therefore, when you use joins, you should be aware of the case where null values appear as joined fields.
Resources:
[1]. difference between = = = null and isNull in Spark datadrame
Dataframe JOIN operation in Spark SQL column with null values