Spark Shell簡單使用

Spark Shell簡單使用_RDD

最後更新：2018-08-21 來源：互聯網

上載者：User

創建阿里雲帳戶，並獲得超過 40 款產品的免費試用版；而企業帳戶則可以享有總值 $1200 的免費試用版。立即註冊！

基礎

Spark的shell作為一個強大的互動式資料分析工具，提供了一個簡單的方式學習API。它可以使用Scala(在Java虛擬機器上運行現有的Java庫的一個很好方式)或Python。在Spark目錄裡使用下面的方式開始運行：

./bin/spark-shell

在Spark Shell中，有一個專有的SparkContext已經為您建立好了，變數名叫做sc。自己建立的SparkContext將無法工作。可以用--master參數來設定SparkContext要串連的叢集，用--jars來設定需要添加到CLASSPATH的jar包，如果有多個jar包，可以使用逗號分隔字元串連它們。例如，在一個擁有4核的環境上運行spark-shell，使用：

./bin/spark-shell --master local[4]

或在CLASSPATH中添加code.jar，使用：

./bin/spark-shell --master local[4] --jars code.jar

可以執行spark-shell --help擷取完整的選項列表。 Spark最主要的抽象是叫Resilient Distributed Dataset(RDD)的彈性分布式集合。RDDs可以使用Hadoop InputFormats(例如HDFS檔案)建立，也可以從其他的RDDs轉換。讓我們在Spark原始碼目錄裡從README.md文字檔中建立一個新的RDD。

scala> val textFile = sc.textFile("file:///home/hadoop/hadoop/spark/README.md")16/07/24 03:30:53 INFO storage.MemoryStore: ensureFreeSpace(217040) called with curMem=321016, maxMem=28024897516/07/24 03:30:53 INFO storage.MemoryStore: Block broadcast_2 stored as values in memory (estimated size 212.0 KB, free 266.8 MB)16/07/24 03:30:53 INFO storage.MemoryStore: ensureFreeSpace(20024) called with curMem=538056, maxMem=28024897516/07/24 03:30:53 INFO storage.MemoryStore: Block broadcast_2_piece0 stored as bytes in memory (estimated size 19.6 KB, free 266.7 MB)16/07/24 03:30:53 INFO storage.BlockManagerInfo: Added broadcast_2_piece0 in memory on localhost:43303 (size: 19.6 KB, free: 267.2 MB)16/07/24 03:30:53 INFO spark.SparkContext: Created broadcast 2 from textFile at <console>:21textFile: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[5] at textFile at <console>:21

注意：1. 其中2~7行是日誌資訊，暫且不必關注，主要看最後一行。之後的作業記錄資訊將不再貼出。使用者也可以進入到spark目錄/conf檔案夾下，此時有一個log4j.properties.template檔案，我們執行如下命令將其拷貝一份為log4j.properties，並對log4j.properties檔案進行修改。

cp log4j.properties.template log4j.propertiesvim log4j.properties

如下圖所示，將INFO改為WARN，這樣就不輸出藍色部分的日誌資訊：

2. 另外，file:///home/hadoop/hadoop/spark/README.md，首部的file代表本地目錄，注意file:後有三個斜杠(/)；中間紅色部分是我的spark安裝目錄，讀者可根據自己的情況進行替換。

RDD的actions從RDD中傳回值，transformations可以轉換成一個新RDD並返回它的引用。下面展示幾個action：

scala> textFile.count()res0: Long = 98scala> textFile.first()res1: String = # Apache Spark

其中，count代表RDD中的總資料條數；first代表RDD中的第一行資料。

下面使用一個transformation，我們將使用filter函數對textFile這個RDD進行過濾，取出包含字串"Spark"的行，並返回一個新的RDD：

scala> val linesWithSpark = textFile.filter(line => line.contains("Spark"))linesWithSpark: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[2] at filter at <console>:23

當然也可以把actions和transformations串連在一起使用：

scala> textFile.filter(line => line.contains("Spark")).count()res2: Long = 19

上面這條語句表示有多少行包括字串"Spark"。

更多RDD操作

RDD actions和transformations能被用在更多的複雜計算中。比如想要找到一行中最多的單詞數量：

scala> textFile.map(line => line.split(" ").size).reduce((a, b) => if (a > b) a else b)res3: Int = 14

首先將行映射成一個整型數值產生一個新的RDD。在這個新的RDD上調用reduce找到行中最大的單詞數個數。map和reduce的參數是Scala的函數串(閉包)，並且可以使用任何語言特性或者Scala/Java類庫。例如，我們可以很方便地調用其他的函式宣告。我們使用Math.max()函數讓代碼更容易理解：

scala> import java.lang.Mathimport java.lang.Mathscala> textFile.map(line => line.split(" ").size).reduce((a, b) => Math.max(a, b))res4: Int = 14

大家都知道，Hadoop流行的一個通用資料流模式是MapReduce。Spark能夠很容易地實現MapReduce：

scala> val wordCounts = textFile.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey((a, b) => a + b)wordCounts: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[8] at reduceByKey at <console>:24

這裡，我們結合了flatMap、map和reduceByKey來計算檔案裡每個單詞出現的數量，它的結果是包含一組(String, Int)索引值對的RDD。我們可以使用collect操作收集單詞的數量：

scala> wordCounts.collect()res5: Array[(String, Int)] = Array((package,1), (For,2), (Programs,1), (processing.,1), (Because,1), (The,1), (cluster.,1), (its,1), ([run,1), (APIs,1), (have,1), (Try,1), (computation,1), (through,1), (several,1), (This,2), ("yarn-cluster",1), (graph,1), (Hive,2), (storage,1), (["Specifying,1), (To,2), (page](http://spark.apache.org/documentation.html),1), (Once,1), (application,1), (prefer,1), (SparkPi,2), (engine,1), (version,1), (file,1), (documentation,,1), (processing,,2), (the,21), (are,1), (systems.,1), (params,1), (not,1), (different,1), (refer,2), (Interactive,2), (given.,1), (if,4), (build,3), (when,1), (be,2), (Tests,1), (Apache,1), (all,1), (./bin/run-example,2), (programs,,1), (including,3), (Spark.,1), (package.,1), (1000).count(),1), (Versions,1), (HDFS,1), (Data.,1), (>...

緩衝

Spark支援把資料集緩衝到記憶體之中，當要重複訪問時，這是非常有用的。舉一個簡單的例子：

scala> linesWithSpark.cache()res6: linesWithSpark.type = MapPartitionsRDD[2] at filter at <console>:23scala> linesWithSpark.count()res7: Long = 19scala> linesWithSpark.count()res8: Long = 19scala> linesWithSpark.count()res9: Long = 19

首先緩衝linesWithSpark資料集，然後重複訪問count函數返回的值。當然，我們並不能察覺明顯的查詢速度變化，但是當在大型的資料集中使用緩衝，將會非常顯著的提升相應的迭代操作速度。

本文章原先以中文撰寫並發佈於 aliyun.com，亦設英文版本，僅作資訊用途。本網站不對文章的準確性，完整性或可靠性或其任何翻譯作出任何明示或暗示的陳述或保證。如對該文章有任何疑慮或投訴，請傳送電郵至 info-contact@alibabacloud.com 並提供相關疑慮或投訴的詳細說明。職員會於 5 個工作天內與您聯絡，一經驗證之後，即會刪除該侵權內容。

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More