Spark cultivation Path (advanced)--spark Getting started to Mastery: Tenth Spark SQL case scenario (i)

Last Update:2016-02-06 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Zhou Zhihu L.

Holiday, finally can spare time to update the blog ....

1. Get Data

This article provides a detailed introduction to Sparksql's content by using the Spark project git log on GitHub as the data.
The Data Acquisition command is as follows:

[[email protected] spark]# git log  --pretty=format:‘{"commit":"%H","author":"%an","author_email":"%ae","date":"%ad","message":"%f"}‘ > sparktest.json

The output of the formatted log content is as follows:

[Email protected] spark]# head-1 sparktest.json{"Commit":" -B706b7b36482921ec04145a0121ca147984fa8","Author":"Josh Rosen","Author_email":"[Email protected]","Date":"Fri Nov6  -: -: the  --0800","Message":"spark-11389-core-ADD-support- for-off-heap-memory- to-memorymanager"}

Then use the command to upload the Sparktest.json file to HDFs.

[root@master spark]#hadoop dfs -put sparktest.json /data/

2. Create Dataframe

Creating Dataframe with Data

scala> val df = sqlContext.read.json("/data/sparktest.json")16/02/0509:59:56 INFO json.JSONRelation: Listing hdfs://ns1/data/sparktest.json on driver

To view its mode:

stringtruestringtruestringtruedatestringtruestringtrue)

3. Dataframe Method Combat

(1) Explicit first two rows of data

scala> df.show(2)+----------------+--------------------+--------------------+--------------------+--------------------+|          author|        author_email|              commit|                date|             message|+----------------+--------------------+--------------------+--------------------+--------------------+|      Josh Rosen|[email protected]|30b706b7b36482921...|Fri Nov 6 18:17:3...|SPARK-11389-CORE-...||Michael Armbrust|[email protected]|105732dcc6b651b97...|Fri Nov 6 17:22:3...|HOTFIX-Fix-python...|+----------------+--------------------+--------------------+--------------------+--------------------+

(2) Calculate the total number of submissions

scala> df.countres4:13507给出的是我github上的commits次数，可以看到，其结束是一致的

(3) descending order by number of submissions

Scala>df.groupby ("Author"). Count.sort ($"Count". desc). show+--------------------+-----+| Author|count|+--------------------+-----+| Matei zaharia| 1590|| Reynold xin| 1071||  Patrick wendell| 857||  Tathagata das| 416||  Josh rosen| 348||  Mosharaf chowdhury| 290||  Andrew or| 287||  Xiangrui meng| 285||  Davies liu| 281||  Ankur dave| 265||  Cheng lian| 251||  Michael armbrust| 243||  Zsxwing| 200||  Sean owen| 197||  Prashant sharma| 186||  Joseph E. gonzalez| 185||  Yin huai| 177||  Shivaram venkatar...| 173||  Aaron davidson| 164||  Marcelo vanzin| 142|+--------------------+-----+only showing top -Rows

4. Dataframe registration as a temporary table use combat

Use the following statement to register Dataframe as a table

val commitLog=df.registerTempTable("commitlog")

(1) Display the first 2 rows of data

scala> sqlContext.sql("SELECT * FROM commitlog").show(2)+----------------+--------------------+--------------------+--------------------+--------------------+|          author|        author_email|              commit|                date|             message|+----------------+--------------------+--------------------+--------------------+--------------------+|      Josh Rosen|[email protected]|30b706b7b36482921...|Fri Nov 6 18:17:3...|SPARK-11389-CORE-...||Michael Armbrust|[email protected]|105732dcc6b651b97...|Fri Nov 6 17:22:3...|HOTFIX-Fix-python...|+----------------+--------------------+--------------------+--------------------+--------------------+

(2) Calculate the total number of submissions

scala> sqlContext.sql("SELECT count(*) as TotalCommitNumber  FROM commitlog").show+-----------------+|TotalCommitNumber|+-----------------+|            13507|+-----------------+

(3) descending order by number of submissions

Scala> Sqlcontext.sql ("Select Author,count (*) as CountNumber from Commitlog GROUP by author ORDER by CountNumber DESC"). show+--------------------+-----------+| author| countnumber|+--------------------+-----------+|       Matei zaharia| 1590||       Reynold xin| 1071||        Patrick wendell| 857||        Tathagata das| 416||        Josh rosen| 348||        Mosharaf chowdhury| 290||        Andrew or| 287||        Xiangrui meng| 285||        Davies liu| 281||        Ankur dave| 265||        Cheng lian| 251||        Michael armbrust| 243||        Zsxwing| 200||        Sean owen| 197||        Prashant sharma| 186||        Joseph E. gonzalez| 185||        Yin huai| 177||        Shivaram venkatar...| 173||        Aaron davidson| 164||        Marcelo vanzin| 142|+--------------------+-----------+

More complex gameplay, you can try it yourself, here is only the Dataframe method and the Temporal table SQL statement usage differences, in order to have a holistic understanding.

Spark cultivation Path (advanced)--spark Getting started to Mastery: Tenth Spark SQL case scenario (i)

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Spark cultivation Path (advanced)--spark Getting started to Mastery: Tenth Spark SQL case scenario (i)

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Spark cultivation Path (advanced)--spark Getting started to Mastery: Tenth Spark SQL case scenario (i)

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support