Spark SQL Statistics which constellation of people most like to open room

Source: Internet
Author: User

Recently, it's interesting to see an example that has been specially reproduced.

Analyze data using Spark SQL

In this step, we use Spark SQL to group the 2000W data by constellation to see which constellation people prefer to open the room.
Of course, using pure spark can also do our analysis, because the actual spark SQL is ultimately done with Spark.
The actual test found that this data is not fully adhere to a schema, some of the data format is not correct, some data items are also wrong. In the code we want to eliminate that disturbing data.
Anyway, we use this data tester to play, and there is no strict requirements to organize which error data.

Look at the code first:

val sqlContext =  new org.apache.spark.sql.SQLContext(sc) import sqlContext.createSchemaRDD case class  Customer(name: String, gender: String, ctfId: String, birthday: String, address: String) val customer = sc.textFile( "/mnt/share/2000W/*.csv" ).map(_.split( "," )).filter(line => line.length >  7 ).map(p => Customer(p( 0 ), p( 5 ), p( 4 ), p( 6 ), p( 7 ))).distinct() customer.registerTempTable( "customer" ) def toInt(s: String):Int = { try { s.toInt catch { case e:Exception =>  9999 } } def myfun(birthday: String) : String = { var rt =  "未知" if (birthday.length ==  8 ) { val md = toInt(birthday.substring( 4 )) if (md >=  120 & md <=  219 ) rt =  "水瓶座" else if  (md >=  220 & md <=  320 ) rt =  "双鱼座" else if  (md >=  321 & md <=  420 ) rt =  "白羊座" else if  (md >=  421 & md <=  521 ) rt =  "金牛座" else if  (md >=  522 & md <=  621 ) rt =  "双子座" else if  (md >=  622 & md <=  722 ) rt =  "巨蟹座" else if  (md >=  723 & md <=  823 ) rt =  "狮子座" else if  (md >=  824 & md <=  923 ) rt =  "处女座" else if  (md >=  924 & md <=  1023 ) rt =  "天秤座" else if  (md >=  1024 & md <=  1122 ) rt =  "天蝎座" else if  (md >=  1123 & md <=  1222 ) rt =  "射手座" else if  ((md >=  1223 & md <=  1231 ) | (md >=  101 & md <=  119 )) rt =  "摩蝎座" else rt =  "未知" } rt } sqlContext.registerFunction( "constellation" , (x:String) => myfun(x)) var result = sqlContext.sql( "SELECT constellation(birthday), count(constellation(birthday)) FROM customer group by constellation(birthday)" ) result.collect().foreach(println)

In order to use Spark SQL, you need to introduce sqlContext.createSchemaRDD . A core object of Spark SQL is SchemaRDD . The above import can implicitly convert an RDD to Schemardd.
Then we define the Customer class, which is used to map the data of each row, we use only a little information, such as address, email and so on.
Next, read all the CSV files from the 2000W folder, create an RDD and register the customer.
Since none of the built-in functions can map birth together as constellations, we need to define a mapping function myfun and register it in Sparkcontext. This allows us to use this function in SQL statements. Similarly, the length function of the string is not currently supported, and you can add one of these functions. Since some dates are incorrect, all of them have added an "unknown" constellation in particular. There may be two kinds of error data, one is a date error, but this line is not formatted, and the other fields are mapped to the date of birth. We ignore them when we analyze them.

Then execute a grouped SQL statement. The SQL statement query result type is Schemardd, and it inherits all the actions of the RDD.
Finally, the results are printed out.

[双子座,1406018][双鱼座,1509839][摩蝎座,2404812][金牛座,1406225][水瓶座,1635358][巨蟹座,1498077][处女座,1666009][天秤座,1896544][白羊座,1409838][射手座,1614915][未知,160483][狮子座,1613529]

It seems that the scorpion people like to open a room, obviously more than the other constellations of people.

We can also analyze the ratio of men to women in the open room:

?
123 ......result = sqlContext.sql("SELECT gender, count(gender) FROM customer where gender = ‘F‘ or gender = ‘M‘ group by gender")result.collect().foreach(println)

The results show that the number of men and women who open rooms is about 2:1

?
12 [F,6475461][M,12763926]

Spark SQL Statistics which constellation of people most like to open room

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.