標籤:
(根據最新情況進行修正)
毋庸置疑,Spark已經成為最火的大資料工具,本文詳細介紹安裝SparkR的方法,讓你在5分鐘之內能在本地使用。
?環境要求:java 7+ 、R 及 Rstudio
Rtools(:https://cran.r-project.org/bin/windows/Rtools/)
第一步:下載Spark
?在瀏覽器開啟 http://spark.apache.org/,點擊右邊的綠色按鈕“DownloadSpark”
你會看到如下頁面:
?按照上面的 1到3 建立下載連結。
在“2. Choose a packagetype” 選項中,選擇一個 pre-built的類型(如)。
因為我們打算在Windows下本地運行,所以選擇 Pre-built package forHadoop 2.6 and later 。
在“3. Choose a download type”選擇 “Direct Download” 。
選好之後,一個下載連結就在4. DownloadSpark”建立好了。?
把這個壓縮檔下載到你的電腦上。
第二步:解壓縮安裝檔案
?解壓縮到路徑“C:/Apache/Spark-1.4.1″
?第三步:用命令列運行(此步需要配置完成R和其他的環境變數後才會生效,如果不需要命令列視窗,可直接跳過此步驟)
?開啟命令列視窗(開始-搜尋方塊中輸入cmd),更改路徑:
輸入命令 ".\bin\sparkR"
?成功後會看到一些日誌,大約15s後,一切順利的話,會有“Welcometo SparkR!”
設定環境變數:
?在“我的電腦”右擊,選擇“屬性”:
?選擇“Advanced system settings”
?點擊“Environment Variables”,在下面的“Systemvariables“裡面找到Path,並加入“C:\ProgramData\Oracle\Java\javapath;“
?第四步:在Rstudio中運行?
?#(附一個例子)?# Set the system environment variablesSys.setenv(SPARK_HOME = "C:/Apache/spark-1.6.1").libPaths(c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib"), .libPaths()))
#注意把spark-1.6.1目錄下R目錄下的lib裡面的SparkR放入R的library裡面,否則無法直接安裝sparkR的包?
R的library地址可通過如下方式進行查看:
.libPaths()預設情況下會將新的lib庫安裝在第一個地址中(預設地址)
#load the Sparkr librarylibrary(SparkR)# Create a spark context and a SQL contextsc <- sparkR.init(master = "local")sqlContext <- sparkRSQL.init(sc)#create a sparkR DataFrameDF <- createDataFrame(sqlContext, faithful)head(DF)# Create a simple local data.framelocalDF <- data.frame(name=c("John", "Smith", "Sarah"), age=c(19, 23, 18))# Convert local data frame to a SparkR DataFramedf <- createDataFrame(sqlContext, localDF)# Print its schemaprintSchema(df)# root# |-- name: string (nullable = true)# |-- age: double (nullable = true)# Create a DataFrame from a JSON filepath <- file.path(Sys.getenv("SPARK_HOME"), "examples/src/main/resources/people.json")peopleDF <- jsonFile(sqlContext, path)printSchema(peopleDF)# Register this DataFrame as a table.registerTempTable(peopleDF, "people")# SQL statements can be run by using the sql methods provided by sqlContextteenagers <- sql(sqlContext, "SELECT name FROM people WHERE age >= 13 AND age <= 19")# Call collect to get a local data.frameteenagersLocalDF <- collect(teenagers)# Print the teenagers in our datasetprint(teenagersLocalDF)# Stop the SparkContext nowsparkR.stop()
?
?
# 另一個例子 wordcount--------------# 來源 http://www.cnblogs.com/hseagle/p/3998853.htmlsc <- sparkR.init(master="local", "RwordCount")lines <- textFile(sc, "README.md")
——————“textFile”函數從sparkR1.4之後就無法使用了,之後的sparkR必須通過SqlContext來載入資料,如下所示:
people <- read.df(sqlContext,"./examples/src/main/resources/people.json", "json")
除此之外還支援csv、parquet、hive資料等等。
words <- flatMap(lines, function(line) { strsplit(line, " ")[[1]] })wordCount <- lapply(words, function(word) { list(word, 1L) })counts <- reduceByKey(wordCount, "+", 2L)output <- collect(counts)for (wordcount in output) { cat(wordcount[[1]], ": ", wordcount[[2]], "\n")}
?原文地址:http://www.r-bloggers.com/installing-and-starting-sparkr-locally-on-windows-os-and-rstudio/
?參考資料:
1. 安裝 http://blog.csdn.net/jediael_lu/article/details/45310321
2. 安裝 http://thinkerou.com/2015-05/How-to-Build-Spark-on-Windows/
3. 徽滬一郎的部落格:http://www.cnblogs.com/hseagle/p/3998853.html
4. 學習 http://www.r-bloggers.com/a-first-look-at-spark/?
5. 學習 http://www.danielemaasit.com/getting-started-with-sparkr/
6. ??錯誤解決:http://stackoverflow.com/questions/10077689/r-cmd-on-windows-7-error-r-is-not-recognized-as-an-internal-or-external-comm
7.SparkR官方指導 http://spark.apache.org/docs/latest/sparkr.html#from-local-data-frames(中文版:http://www.iteblog.com/archives/1385)
【轉+修正】在Windows和Rstudio下本地安裝SparkR