Recently, it's interesting to see an example that has been specially reproduced.
Analyze data using Spark SQL
In this step, we use Spark SQL to group the 2000W data by constellation to see which constellation people prefer to open the room.
Of course, using pure spark can also do our analysis, because the actual spark SQL is ultimately done with Spark.
The actual test found that this data is not fully adhere to a schema, some of the data format is not correct, some data items are also wrong. In the code we want to eliminate that disturbing data.
Anyway, we use this data tester to play, and there is no strict requirements to organize which error data.
Look at the code first:
val sqlContext =
new
org.apache.spark.sql.SQLContext(sc)
import
sqlContext.createSchemaRDD
case
class
Customer(name: String, gender: String, ctfId: String, birthday: String, address: String)
val customer = sc.textFile(
"/mnt/share/2000W/*.csv"
).map(_.split(
","
)).filter(line => line.length >
7
).map(p => Customer(p(
0
), p(
5
), p(
4
), p(
6
), p(
7
))).distinct()
customer.registerTempTable(
"customer"
)
def toInt(s: String):Int = {
try
{
s.toInt
}
catch
{
case
e:Exception =>
9999
}
}
def myfun(birthday: String) : String = {
var rt =
"未知"
if
(birthday.length ==
8
) {
val md = toInt(birthday.substring(
4
))
if
(md >=
120
& md <=
219
)
rt =
"水瓶座"
else
if
(md >=
220
& md <=
320
)
rt =
"双鱼座"
else
if
(md >=
321
& md <=
420
)
rt =
"白羊座"
else
if
(md >=
421
& md <=
521
)
rt =
"金牛座"
else
if
(md >=
522
& md <=
621
)
rt =
"双子座"
else
if
(md >=
622
& md <=
722
)
rt =
"巨蟹座"
else
if
(md >=
723
& md <=
823
)
rt =
"狮子座"
else
if
(md >=
824
& md <=
923
)
rt =
"处女座"
else
if
(md >=
924
& md <=
1023
)
rt =
"天秤座"
else
if
(md >=
1024
& md <=
1122
)
rt =
"天蝎座"
else
if
(md >=
1123
& md <=
1222
)
rt =
"射手座"
else
if
((md >=
1223
& md <=
1231
) | (md >=
101
& md <=
119
))
rt =
"摩蝎座"
else
rt =
"未知"
}
rt
}
sqlContext.registerFunction(
"constellation"
, (x:String) => myfun(x))
var result = sqlContext.sql(
"SELECT constellation(birthday), count(constellation(birthday)) FROM customer group by constellation(birthday)"
)
result.collect().foreach(println)
In order to use Spark SQL, you need to introduce sqlContext.createSchemaRDD
. A core object of Spark SQL is SchemaRDD
. The above import
can implicitly convert an RDD to Schemardd.
Then we define the Customer
class, which is used to map the data of each row, we use only a little information, such as address, email and so on.
Next, read all the CSV files from the 2000W folder, create an RDD and register the customer.
Since none of the built-in functions can map birth together as constellations, we need to define a mapping function myfun
and register it in Sparkcontext. This allows us to use this function in SQL statements. Similarly, the length function of the string is not currently supported, and you can add one of these functions. Since some dates are incorrect, all of them have added an "unknown" constellation in particular. There may be two kinds of error data, one is a date error, but this line is not formatted, and the other fields are mapped to the date of birth. We ignore them when we analyze them.
Then execute a grouped SQL statement. The SQL statement query result type is Schemardd, and it inherits all the actions of the RDD.
Finally, the results are printed out.
[双子座,
1406018
]
[双鱼座,
1509839
]
[摩蝎座,
2404812
]
[金牛座,
1406225
]
[水瓶座,
1635358
]
[巨蟹座,
1498077
]
[处女座,
1666009
]
[天秤座,
1896544
]
[白羊座,
1409838
]
[射手座,
1614915
]
[未知,
160483
]
[狮子座,
1613529
]
|
It seems that the scorpion people like to open a room, obviously more than the other constellations of people.
We can also analyze the ratio of men to women in the open room:
?
123 |
...... result = sqlContext.sql( "SELECT gender, count(gender) FROM customer where gender = ‘F‘ or gender = ‘M‘ group by gender" ) result.collect().foreach(println) |
The results show that the number of men and women who open rooms is about 2:1
?
12 |
[F, 6475461 ] [M, 12763926 ] |
Spark SQL Statistics which constellation of people most like to open room