Spark cultivation Path (advanced)--spark Getting started to Mastery: Tenth Spark SQL case scenario (i)

Source: Internet
Author: User

Zhou Zhihu L.

Holiday, finally can spare time to update the blog ....

1. Get Data

This article provides a detailed introduction to Sparksql's content by using the Spark project git log on GitHub as the data.
The Data Acquisition command is as follows:

[[email protected] spark]# git log  --pretty=format:‘{"commit":"%H","author":"%an","author_email":"%ae","date":"%ad","message":"%f"}‘ > sparktest.json

The output of the formatted log content is as follows:

[Email protected] spark]# head-1 sparktest.json{"Commit":" -B706b7b36482921ec04145a0121ca147984fa8","Author":"Josh Rosen","Author_email":"[Email protected]","Date":"Fri Nov6  -: -: the  --0800","Message":"spark-11389-core-ADD-support- for-off-heap-memory- to-memorymanager"}

Then use the command to upload the Sparktest.json file to HDFs.

[root@master spark]#hadoop dfs -put sparktest.json /data/
2. Create Dataframe

Creating Dataframe with Data

scala> val df = sqlContext.read.json("/data/sparktest.json")16/02/0509:59:56 INFO json.JSONRelation: Listing hdfs://ns1/data/sparktest.json on driver

To view its mode:

stringtruestringtruestringtruedatestringtruestringtrue)
3. Dataframe Method Combat

(1) Explicit first two rows of data

scala> df.show(2)+----------------+--------------------+--------------------+--------------------+--------------------+|          author|        author_email|              commit|                date|             message|+----------------+--------------------+--------------------+--------------------+--------------------+|      Josh Rosen|[email protected]|30b706b7b36482921...|Fri Nov 6 18:17:3...|SPARK-11389-CORE-...||Michael Armbrust|[email protected]|105732dcc6b651b97...|Fri Nov 6 17:22:3...|HOTFIX-Fix-python...|+----------------+--------------------+--------------------+--------------------+--------------------+

(2) Calculate the total number of submissions

scala> df.countres4:13507给出的是我github上的commits次数,可以看到,其结束是一致的

(3) descending order by number of submissions

Scala>df.groupby ("Author"). Count.sort ($"Count". desc). show+--------------------+-----+| Author|count|+--------------------+-----+| Matei zaharia| 1590|| Reynold xin| 1071||  Patrick wendell| 857||  Tathagata das| 416||  Josh rosen| 348||  Mosharaf chowdhury| 290||  Andrew or| 287||  Xiangrui meng| 285||  Davies liu| 281||  Ankur dave| 265||  Cheng lian| 251||  Michael armbrust| 243||  Zsxwing| 200||  Sean owen| 197||  Prashant sharma| 186||  Joseph E. gonzalez| 185||  Yin huai| 177||  Shivaram venkatar...| 173||  Aaron davidson| 164||  Marcelo vanzin| 142|+--------------------+-----+only showing top -Rows
4. Dataframe registration as a temporary table use combat

Use the following statement to register Dataframe as a table

val commitLog=df.registerTempTable("commitlog")

(1) Display the first 2 rows of data

scala> sqlContext.sql("SELECT * FROM commitlog").show(2)+----------------+--------------------+--------------------+--------------------+--------------------+|          author|        author_email|              commit|                date|             message|+----------------+--------------------+--------------------+--------------------+--------------------+|      Josh Rosen|[email protected]|30b706b7b36482921...|Fri Nov 6 18:17:3...|SPARK-11389-CORE-...||Michael Armbrust|[email protected]|105732dcc6b651b97...|Fri Nov 6 17:22:3...|HOTFIX-Fix-python...|+----------------+--------------------+--------------------+--------------------+--------------------+

(2) Calculate the total number of submissions

scala> sqlContext.sql("SELECT count(*) as TotalCommitNumber  FROM commitlog").show+-----------------+|TotalCommitNumber|+-----------------+|            13507|+-----------------+

(3) descending order by number of submissions

Scala> Sqlcontext.sql ("Select Author,count (*) as CountNumber from Commitlog GROUP by author ORDER by CountNumber DESC"). show+--------------------+-----------+| author| countnumber|+--------------------+-----------+|       Matei zaharia| 1590||       Reynold xin| 1071||        Patrick wendell| 857||        Tathagata das| 416||        Josh rosen| 348||        Mosharaf chowdhury| 290||        Andrew or| 287||        Xiangrui meng| 285||        Davies liu| 281||        Ankur dave| 265||        Cheng lian| 251||        Michael armbrust| 243||        Zsxwing| 200||        Sean owen| 197||        Prashant sharma| 186||        Joseph E. gonzalez| 185||        Yin huai| 177||        Shivaram venkatar...| 173||        Aaron davidson| 164||        Marcelo vanzin| 142|+--------------------+-----------+

More complex gameplay, you can try it yourself, here is only the Dataframe method and the Temporal table SQL statement usage differences, in order to have a holistic understanding.

Spark cultivation Path (advanced)--spark Getting started to Mastery: Tenth Spark SQL case scenario (i)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.