標籤:hadoop spark terminal ubuntu
1、複製檔案至HDFS:[email protected]:/usr/local/hadoop$ bin/hdfs dfs -mkdir /user
[email protected]:/usr/local/hadoop$ bin/hdfs dfs -mkdir /user/hadoop
[email protected]:/usr/local/hadoop$ bin/hdfs dfs -copyFromLocal /usr/local/spark/spark-1.3.1-bin-hadoop2.4/README.md /user/hadoop/
2、運行spark-shell
3、讀取檔案統計spark這個詞出現次數scala> sc
res0: org.apache.spark.SparkContext = [email protected]
scala> val file = sc.textFile("hdfs://Mhadoop:9000/user/hadoop/README.md")file: org.apache.spark.rdd.RDD[String] = hdfs://Mhadoop:9000/user/hadoop/README.md MapPartitionsRDD[1] at textFile at <console>:21
file變數是一個MapPartitionsRDD;接著過濾spark這個詞
scala> val sparks = file.filter(line => line.contains("spark"))
sparks: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[2] at filter at <console>:23
統計spark出現次數,結果為11:scala> sparks.count另開一個terminal用ubuntu內建的wc命令驗證下:
[email protected]:/usr/local/spark/spark-1.3.1-bin-hadoop2.4$ grep spark README.md|wc
11 50 761
4、執行spark cache看下效率提升scala> sparks.cache
res3: sparks.type = MapPartitionsRDD[2] at filter at <console>:23
登入控制台:
http://192.168.85.10:4040/stages/
可見cache之後,耗時從s變為ms,效能提升明顯。
spark-shell初體驗