標籤:spark技術 雲端運算 spark教程 大資料 spark熱點
“決勝雲端運算大資料時代”
Spark亞太地區研究院100期公益大講堂 【第15期互動問答分享】
Q1:AppClient和worker、master之間的關係是什嗎?
AppClient是在StandAlone模式下SparkContext.runJob的時候在Client機器上應 用程式的代表,要完成程式的registerApplication等功能;
當程式完成註冊後Master會通過Akka發送訊息給用戶端來啟動Driver;
在Driver中管理Task和控制Worker上的Executor來協同工作;
Q2:Spark的shuffle 和hadoop的shuffle的區別大嗎?
Spark的Shuffle是一種比較嚴格意義上的shuffle,在Spark中Shuffle是有RDD操作的依賴關係中的Lineage上父RDD中的每個partition元素的內容交給多個子RDD;
在Hadoop中的Shuffle是一個相對模糊的概念,Mapper階段介紹後把資料交給Reducer就會產生Shuffle,Reducer三階段的第一個階段即是Shuffle;
Q3:Spark的HA怎麼處理的?
對於Master的HA,在Standalone模式下,Worker節點自動是HA的,對於Master的HA,一般採用Zookeeper;
Utilizing ZooKeeper to provide leader election and some statestorage, you can launch multiple Masters in your cluster connected to the sameZooKeeper instance. One will be elected “leader” and the others will remain instandby mode. If the current leader dies, another Master will be elected,recover the old Master’s state, and then resume scheduling. The entire recoveryprocess (from the time the the first leader goes down) should take between 1and 2 minutes. Note that this delay only affects scheduling new applications– applications that were already running during Master failover are unaffected;
對於Yarn和Mesos模式,ResourceManager一般也會採用ZooKeeper進行HA;
【互動問答分享】第15期決勝雲端運算大資料時代Spark亞太地區研究院公益大講堂