Azure HDInsight 和 Spark 大資料實戰(二)

最後更新：2015-08-04 來源：互聯網

上載者：User

創建阿里雲帳戶，並獲得超過 40 款產品的免費試用版；而企業帳戶則可以享有總值 $1200 的免費試用版。立即註冊！

標籤：

HDInsight cluster on Linux

登入 Azure portal (https://manage.windowsazure.com )

點擊左下角的 NEW 按鈕，然後點擊 DATA SERVICES 按鈕，點擊 HDINSIGHT，選擇 HADOOP ON LINUX，如所示。

輸入集群名稱，選擇叢集大小和帳號，設定叢集的密碼和儲存帳號，下表是各個參數的含義和配置說明。

Name	Value
Cluster Name	Name of the cluster.
Cluster Size	Number of data nodes you want to deploy. The default value is 4. But the option to use 1 or 2 data nodes is also available from the drop-down. Any number of cluster nodes can be specified by using the Custom Create option. Pricing details on the billing rates for various cluster sizes are available. Click the ? symbol just above the drop-down box and follow the link on the pop-up.
Password	The password for the HTTP account (default user name: admin) and SSH account (default user name: hdiuser). Note that these are NOT the administrator accounts for the virtual machines on which the clusters are provisioned.
Storage Account	Select the Storage account you created from the drop-down box. Once a Storage account is chosen, it cannot be changed. If the Storage account is removed, the cluster will no longer be available for use. The HDInsight cluster is co-located in the same datacenter as the Storage account.

點擊 CREATE HDINSIGHT CLUSTER 即可建立運行於 Azure 的 Hadoop 叢集。

上述過程快速建立一個運行Hadoop 的 Linux 叢集，且預設 SSH 使用者名稱稱為 hdiuser，HTTP 賬戶預設名稱為 admin。若要用自訂選項，例如使用 SSH 金鑰進行身分識別驗證建立群集或使用額外的儲存空間，請參閱 Provision Hadoop Linux clusters in HDInsight using custom options ( https://azure.microsoft.com/en-us/documentation/articles/hdinsight-hadoop-provision-linux-clusters/ ) 。

Installing Spark

在 HDInsight 中點擊建立的 Hadoop叢集（在本例中叢集名稱為 Hadooponlinux ），進入 dashboard，如所示。

在 quick glance 中拷貝 Cluster Connection String的值，此為登入 Hadoop on Linux 配置控制台 Ambari的地址，在瀏覽器中粘貼 Cluster Connection String的值，此時出現登入使用者名稱和密碼的驗證。此時的使用者名稱為上一步中快速建立hadoop叢集時預設HTTP使用者名稱admin，密碼為快速建立hadoop叢集時設定的密碼。

正確輸入使用者名稱和密碼後，出現 Ambari的登入使用者名稱和密碼驗證，此時輸入使用者名稱 admin 密碼為hadoop即可進入Ambari的管理主控台。

展示了使用 Ambari 安裝Spark的過程。

The following diagram shows the Spark installation process using Ambari.

選擇 Ambari "Services" 選項卡。

在 Ambari "Actions" 下拉式功能表中選擇 "Add Service." 這將啟動添加服務嚮導。

選擇 "Spark"，然後點擊 "Next" 。

(For HDP 2.2.4, Ambari will install Spark version 1.2.1, not 1.2.0.2.2.)

Ambari 將顯示警告訊息，確認叢集啟動並執行是 HDP 2.2.4 或更高版本，然後單擊 "Proceed"。

	Note
	You can reconfirm component versions in Step 6 before finalizing the upgrade.

選擇Spark 曆史伺服器節點，點擊 Click "Next" 繼續。

指定 Spark 的Slaves ，點擊 "Next" 繼續。
在客戶化服務介面建議您使用預設值為您的初始配置，然後點擊 "Next" 繼續。
Ambari 顯示確認介面，點擊 "Deploy" 繼續。

	Important
	On the Review screen, make sure all HDP components are version 2.2.4 or later.

Ambari 顯示安裝、啟動和測試介面，其狀態列和訊息則指示進度。
當Ambari安裝完成，點擊 "Complete" 完成 Spark 的整個安裝過程。

Run Spark

通過 SSH 登入 Hadoop 的 Linux 叢集，執行以下的Linux 指令下載文檔，為後面的Spark程式運行使用。

wget http://en.wikipedia.org/wiki/Hortonworks

將資料拷貝至 Hadoop 叢集的HDFS中，

hadoop fs -put ~/Hortonworks /user/guest/Hortonworks

在很多Spark的例子中採用Scala和Java的應用程式示範，本例中使用 PySpark 來示範基於Python語音的Spark使用方法。

pyspark

第一步使用 Spark Context 即 sc 建立RDD，代碼如下：

myLines = sc.textFile(‘hdfs://sandbox.hortonworks.com/user/guest/Hortonworks‘)

現在我們執行個體化了RDD，下面我們對RDD做轉化的操作。為此我們使用python lambda運算式做篩選。

myLines_filtered = myLines.filter( lambda x: len(x) > 0 )

請注意，以上的python語句不會引發任何RDD的執行操作，只有出現類型以下代碼的count()行為才會引發真正的RDD運算。

myLines_filtered.count()

最終Spark Job運算的結果如下所示。

341.

Data Science with Spark

對於資料科學家而言Spark是一種高度有效資料處理工具。資料科學家經常類似Notebook ( 如 iPython http://ipython.org/notebook.html ) 的工具來快速建立原型並分享他們的工作。許多資料科學家喜好使用 R語言，可喜的是Spark與R的整合即 SparkR已成為 Spark 新興的能力。Apache Zeppelin (https://zeppelin.incubator.apache.org/ ) 是一種新興的工具，提供了基於 Spark 的 Notebook 功能，這裡是Apache Zeppelin 提供的易用於 Spark的使用者介面視圖。

雪松

Microsoft MVP -- Windows Platform Development,

Hortonworks Certified Apache Hadoop 2.0 Developer

Azure HDInsight 和 Spark 大資料實戰(二)

本文章原先以中文撰寫並發佈於 aliyun.com，亦設英文版本，僅作資訊用途。本網站不對文章的準確性，完整性或可靠性或其任何翻譯作出任何明示或暗示的陳述或保證。如對該文章有任何疑慮或投訴，請傳送電郵至 info-contact@alibabacloud.com 並提供相關疑慮或投訴的詳細說明。職員會於 5 個工作天內與您聯絡，一經驗證之後，即會刪除該侵權內容。

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More