代碼版本:spark 2.2.0 本篇文章主要講述一個application的運行過程。 大體分為三部分:(1)SparkConf創建;(2)SparkCoNtext創建;(3)任務執行。 假如我們用scala寫了一個wordcount程式對檔單詞進行計數, package com.spark.myapp import org.apache.spark. {SparkCoNtext, SparkConf} object WordCount { def main(args: Array[String]) { val conf = new SparkConf().setAppName("WordCo unt").setMaster("spark://master:7077」) val sc = new SparkCoNtext(conf) sc.textFile(「README.md").flatMap(_.split(" ")) .map((_, 1)).reduceByKey(_+_).collect().foreach(println) sc.stop() } } 編譯打包jar完成後,要在standalone集群環境提交一個任務, 在提交任務機器安裝的spark目錄下敲入命令:spark-submit --classcom.spark.myapp.WordCount --masterspark://master:7077 /home/xx/myapps/ wordcount.jar 關於spark-submit是如何運行到任務代碼,請參考前面的文章「spark-submit執行過程」。 本篇文章主要講述一個application的運行過程。 大體分為三部分:(1)SparkConf創建;(2)SparkCoNtext創建;(3)任務執行。 1.SparkConf創建 SparkConf包含了Spark集群配置的各種參數,我們看下這個類的說明。 Configuration for a Spark application. Used to set various Spark parameters as key-value pairs. Most of the time, you would create a SparkConf object with`new SparkConf()`, which will loadvalues from any`spark.*`JAVA system properties set in your application as well. In this case,parameters you set directly on the`SparkConf`object take priority over system properties. 重點就是說newSparkConf()會從系統組態裡讀取spark相關參數,參數都是k-v對,然後你可以使用SparkConf的set函數來自己設置覆蓋讀取的配置。 常見的參數設置函數如下: (1)設置master url def setMaster(master: String): SparkConf (2)設置application名稱,在spark web UI展示 def setAppName(n ame: String): SparkConf (3)設置jar包 def setJars(jars: Seq[String]): SparkConf (4)設置Executor環境變數 def setExecutorEnv( variable: String, value: String): SparkConf (5)設置spark home安裝目錄 def setSparkHome(home: String): SparkConf 2.SparkCo Ntext創建 SparkCoNtext是spark開發過程中的重要物件,是spark上層應用和底層api的仲介。 SparkCoNtext的建構函式參數就是上面描述過的SparkConf。 Only one SparkCoNtext may be active per JVM. You must`stop()`the active SparkCoNtext beforecreating a new one. 幾個關鍵屬性為SparkEnv、schedulerBackend、taskScheduler、dagScheduler (1)創建SparkEnv // Create the Spark execution environment ( cache, map output tracker, etc) _env = createSparkEnv(_conf, isLocal, listenerBus) SparkEnv.set(_env) (2)創建schedulerBackend、taskScheduler // Create and start the scheduler val (sched, ts) = SparkCoNtext.createTaskScheduler(this, master, deployMode) _schedulerBackend = s ched _taskScheduler = ts // start TaskScheduler after taskScheduler sets DAGScheduler reference in DAGScheduler’sconstruct or _taskScheduler.start() 其中createTaskScheduler是根據傳入的master參數來返回對應的schedulerBackend、taskScheduler,類似原廠模式。 master match { case「local" caseLOCAL_N_REGEX(threads) caseLOCAL_N_FAILURES_REGEX(threads, maxFailures) caseSPARK_ REGEX(sparkUrl) //standalone 進入這個分支 caseLOCAL_CLUSTER_REGEX(numSlaves, coresPerSlave, memoryPerSlave) casemasterUrl / /其他yarn或mesos集群進入這個分支 } (3)創建dagScheduler _dagScheduler = new DAGScheduler(this) 3.任務執行 spark任務執行的基礎就是RDD(彈性分散式資料集),各種spark運算元在RDD上運算輸出新的RDD,最終得到結果輸出到螢幕或檔或記憶體等。 sc.textFile(「README.md").flatMap(_.split(" ")).map((_, 1)).reduceByKey(_+_).collect().foreach(println) 第一步sc.textFile(「README.md」) def textFile( path: String, minPartitions: Int = defaultMinPartitions): RDD[String] = withScope { assertNotStopped() hadoopF ile(path, classOf[TextInputFormat],classOf[LongWritable],classOf[Text], minPartitions).map(pair => pair._2. toString).setName(path) } hadoopFile輸出結果是HadoopRDD,預設是2個partition;然後進行map操作得到一個新的MapPartitionsRDD,看下面代碼。 詳細過程可以參看前面文章《spark讀寫檔代碼分析》。 RDD的創建只有2種方式:一是從檔案系統或資料庫讀取資料輸入創建;二是從父RDD計算轉換得到新的RDD。 /** * Return a new RDD by applying a function to all elements of this RDD.*/ def map[U: ClassTag](f: T => U): RDD[U] = withScope { val cleanF = sc.clean(f) new MapPartitionsRDD[U, T](this, (coNtext, pid, iter) => iter.map(cleanF)) } 第二步fl atMap(_.split(" ")) /** * Return a new RDD by first applying a function to all elements of thisRDD, and then flattening th e results.*/ def flatMap[U: ClassTag](f: T => TraversableOnce[U]): RDD[U] = withScope { val cleanF = sc.clean(f) new Ma pPartitionsRDD[U, T](this, (coNtext, pid, iter) => iter.flatMap(cleanF)) } 使用空格將單詞分割提取後,合併成一個集合,還是返回MapPartitionsRDD。 第三步map((_, 1))每個資料項目增加了計數1,返回MapPartitionsRDD。 假如有這樣一句話「hello world」,那麼第二步變為(hello, world),第三步變為((hello, 1), (world, 1)) 第四步reduceByKey(_+_) 這個函數不在RDD檔裡面, 而是在PairRDDFunctions。 看代碼是做了一個RDD隱式轉換 /** * Defines implicit functions that provide extra functionalities on RDDs of specific types. * * For example,[[RDD.rddToPairRDDFunctions]]converts an RDD into a [[PairRDDFunctions]]for * key-value-pair RDDs, and ena bling extra functionalities such as`PairRDDFunctions.reduceByKey`.*/ implicit def rddToPairRDDFunctions[K, V](rdd: RDD[ (K, V)]) (implicit kt: ClassTag[K], vt: ClassTag[V], ord: Ordering[K] =null): PairRDDFunctions[K, V] = { new PairRDDFunctions(rdd) } def reduceByKey(func: (V, V) => V): RDD[(K, V)] = self.withScope { reduceByKey(defaultPartitioner(self), func) } def reduceByKey(partitioner: Partitioner, func: (V, V) => V): RDD[(K, V)] = self.withScope { combineByKeyWithClassTag[V]((v : V) => v, func, func, partitioner) } 資料合併之後需要重新分區,分區物件partitioner預設是HashPartitioner 第五步collect().foreach(println) 這2個一起說明,因為此時進入了action運算元(collect和foreach都是action運算元), 前面的都是transformation運算元。 transformation操作是延遲計算的,需要等到action運算元才能真正觸發運算,此時會提交作業job到executor執行。 def collect(): Array[T] = withScope { val results = sc.runJob(this, (iter: Iterator[T]) => iter.toArray) Array.concat(r esults: _*) } def foreach(f: T => Unit): Unit = withScope { val cleanF = sc.clean(f) sc.runJob(this, (iter: Iterator[T] ) => iter.foreach(cleanF)) } 使用的是spark coNtext的runJob,我們來看實現 def runJob[T, U: ClassTag](rdd: RDD[T], func: Iterator[T] => U): Array[U] = { runJob(rdd, func, 0 until rdd.partitions.length) } 最終調用的函數 /** * Run a function on a given set of p artitions in an RDD and pass the results to the given * handler function. This is the main entry point for all actions in Spark. * *@param rddtarget RDD to run tasks on * @param funca function to run on each partition of the RDD * @param partitionsset of partitions to run on; some jobs may not want to compute on all * partitions of the target RDD, e.g. for operations like `first()` * @param resultHandlercallback to pass each result to*/ def runJ ob[T, U: ClassTag]( rdd: RDD[T], func: (TaskCoNtext, Iterator[T]) => U, partitions: Seq[Int], resultHandler: (Int, U) = > Unit): Unit = { if (stopped.get()) { throw new IllegalStateException("SparkCoNtext has been shutdown") } val callSite = getCallSite val cleanedFunc = clean(func) logInfo("Starting job: " + callSite.shortForm) if (conf.getBoolean("spark.log Lineage",false)) { logInfo("RDD's recursive dependencies:\n" + rdd.toDebugString) } dagScheduler.runJob(rdd, cleanedFunc, partitions, callSite, resultHandler,localProperties.get) progressBar.foreach(_.finishAll()) rdd.doCheckpoint() }