Zhou Zhihu L.
Holiday, finally can spare time to update the blog ....
1. Get Data
This article provides a detailed introduction to Sparksql's content by using the Spark project git log on GitHub as the data.
The Data Acquisition command is as follows:
[[email protected] spark]# git log --pretty=format:‘{"commit":"%H","author":"%an","author_email":"%ae","date":"%ad","message":"%f"}‘ > sparktest.json
The output of the formatted log content is as follows:
[Email protected] spark]# head-1 sparktest.json{"Commit":" -B706b7b36482921ec04145a0121ca147984fa8","Author":"Josh Rosen","Author_email":"[Email protected]","Date":"Fri Nov6 -: -: the --0800","Message":"spark-11389-core-ADD-support- for-off-heap-memory- to-memorymanager"}
Then use the command to upload the Sparktest.json file to HDFs.
[root@master spark]#hadoop dfs -put sparktest.json /data/
2. Create Dataframe
Creating Dataframe with Data
scala> val df = sqlContext.read.json("/data/sparktest.json")16/02/0509:59:56 INFO json.JSONRelation: Listing hdfs://ns1/data/sparktest.json on driver
To view its mode:
stringtruestringtruestringtruedatestringtruestringtrue)
3. Dataframe Method Combat
(1) Explicit first two rows of data
scala> df.show(2)+----------------+--------------------+--------------------+--------------------+--------------------+| author| author_email| commit| date| message|+----------------+--------------------+--------------------+--------------------+--------------------+| Josh Rosen|[email protected]|30b706b7b36482921...|Fri Nov 6 18:17:3...|SPARK-11389-CORE-...||Michael Armbrust|[email protected]|105732dcc6b651b97...|Fri Nov 6 17:22:3...|HOTFIX-Fix-python...|+----------------+--------------------+--------------------+--------------------+--------------------+
(2) Calculate the total number of submissions
scala> df.countres4:13507给出的是我github上的commits次数,可以看到,其结束是一致的
(3) descending order by number of submissions
Scala>df.groupby ("Author"). Count.sort ($"Count". desc). show+--------------------+-----+| Author|count|+--------------------+-----+| Matei zaharia| 1590|| Reynold xin| 1071|| Patrick wendell| 857|| Tathagata das| 416|| Josh rosen| 348|| Mosharaf chowdhury| 290|| Andrew or| 287|| Xiangrui meng| 285|| Davies liu| 281|| Ankur dave| 265|| Cheng lian| 251|| Michael armbrust| 243|| Zsxwing| 200|| Sean owen| 197|| Prashant sharma| 186|| Joseph E. gonzalez| 185|| Yin huai| 177|| Shivaram venkatar...| 173|| Aaron davidson| 164|| Marcelo vanzin| 142|+--------------------+-----+only showing top -Rows
4. Dataframe registration as a temporary table use combat
Use the following statement to register Dataframe as a table
val commitLog=df.registerTempTable("commitlog")
(1) Display the first 2 rows of data
scala> sqlContext.sql("SELECT * FROM commitlog").show(2)+----------------+--------------------+--------------------+--------------------+--------------------+| author| author_email| commit| date| message|+----------------+--------------------+--------------------+--------------------+--------------------+| Josh Rosen|[email protected]|30b706b7b36482921...|Fri Nov 6 18:17:3...|SPARK-11389-CORE-...||Michael Armbrust|[email protected]|105732dcc6b651b97...|Fri Nov 6 17:22:3...|HOTFIX-Fix-python...|+----------------+--------------------+--------------------+--------------------+--------------------+
(2) Calculate the total number of submissions
scala> sqlContext.sql("SELECT count(*) as TotalCommitNumber FROM commitlog").show+-----------------+|TotalCommitNumber|+-----------------+| 13507|+-----------------+
(3) descending order by number of submissions
Scala> Sqlcontext.sql ("Select Author,count (*) as CountNumber from Commitlog GROUP by author ORDER by CountNumber DESC"). show+--------------------+-----------+| author| countnumber|+--------------------+-----------+| Matei zaharia| 1590|| Reynold xin| 1071|| Patrick wendell| 857|| Tathagata das| 416|| Josh rosen| 348|| Mosharaf chowdhury| 290|| Andrew or| 287|| Xiangrui meng| 285|| Davies liu| 281|| Ankur dave| 265|| Cheng lian| 251|| Michael armbrust| 243|| Zsxwing| 200|| Sean owen| 197|| Prashant sharma| 186|| Joseph E. gonzalez| 185|| Yin huai| 177|| Shivaram venkatar...| 173|| Aaron davidson| 164|| Marcelo vanzin| 142|+--------------------+-----------+
More complex gameplay, you can try it yourself, here is only the Dataframe method and the Temporal table SQL statement usage differences, in order to have a holistic understanding.
Spark cultivation Path (advanced)--spark Getting started to Mastery: Tenth Spark SQL case scenario (i)