Apache Spark源碼走讀之4 -- DStream即時資料流資料處理

最後更新：2014-07-07 來源：互聯網

上載者：User

創建阿里雲帳戶，並獲得超過 40 款產品的免費試用版；而企業帳戶則可以享有總值 $1200 的免費試用版。立即註冊！

標籤：des blog http java 使用 strong

歡迎轉載，轉載請註明出處，徽滬一郎。

Spark Streaming能夠對流資料進行近乎即時的速度進行資料處理。採用了不同於一般的流式資料處理模型，該模型使得Spark Streaming有非常高的處理速度，與storm相比擁有更高的吞能力。

本篇簡要分析Spark Streaming的處理模型，Spark Streaming系統的初始化過程，以及當接收到外部資料時後續的處理步驟。

系統概述流資料的特點

與一般的檔案（即內容已經固定）型資料來源相比，所謂的流資料擁有如下的特點

資料一直處在變化中
資料無法回退
資料一直源源不斷的湧進

DStream

如果要用一句話來概括Spark Streaming的處理思路的話，那就是"將連續的資料持久化，離散化，然後進行批量處理"。

讓我們來仔細分析一下這麼作的原因。

資料持久化 將從網路上接收到的資料先暫時儲存下來，為事件處理出錯時的事件重演提供可能，
離散化 資料來源源不斷的湧進，永遠沒有一個盡頭，就像周星馳的喜劇中所說“崇拜之情如黃河之水綿綿不絕，一發而不可收拾”。既然不能窮盡，那麼就將其按時間分區。比如採用一分鐘為時間間隔，那麼在連續的一分鐘內收集到的資料集中儲存在一起。
批量處理 將持久化下來的資料分批進行處理，處理機制套用之前的RDD模式

DStream可以說是對RDD的又一層封裝。如果開啟DStream.scala和RDD.scala，可以發現幾乎RDD上的所有operation在DStream中都有相應的定義。

作用於DStream上的operation分成兩類

Transformation
Output 表示將輸出結果，目前支援的有print, saveAsObjectFiles, saveAsTextFiles, saveAsHadoopFiles

DStreamGraph

有輸入就要有輸出，如果沒有輸出，則前面所做的所有動作全部沒有意義，那麼如何將這些輸入和輸出綁定起來呢？這個問題的解決就依賴於DStreamGraph，DStreamGraph記錄輸入的Stream和輸出的Stream。

  private val inputStreams = new ArrayBuffer[InputDStream[_]]()  private val outputStreams = new ArrayBuffer[DStream[_]]()  var rememberDuration: Duration = null  var checkpointInProgress = false

outputStreams中的元素是在有Output類型的Operation作用於DStream上時自動添加到DStreamGraph中的。

outputStream區別於inputStream一個重要的地方就是會重載generateJob.

初始化流程

StreamingContext

StreamingContext是Spark Streaming初始化的進入點，主要的功能是根據入參來產生JobScheduler

設定InputStream

如果流資料來源來自於socket，則使用socketStream。如果資料來源來自於不斷變化著的檔案，則可使用fileStream

提交運行

StreamingContext.start()

資料處理

以socketStream為例，資料來自於socket。

SocketInputDstream啟動一個線程，該線程使用receive函數來接收資料

 def receive() {                                                                                                              var socket: Socket = null                                                                                                  try {                                                                                                                        logInfo("Connecting to " + host + ":" + port)                                                                              socket = new Socket(host, port)                                                                                            logInfo("Connected to " + host + ":" + port)                                                                               val iterator = bytesToObjects(socket.getInputStream())                                                                     while(!isStopped && iterator.hasNext) {                                                                                      store(iterator.next)                                                                                                     }                                                                                                                          logInfo("Stopped receiving")                                                                                               restart("Retrying connecting to " + host + ":" + port)                                                                   } catch {                                                                                                                    case e: java.net.ConnectException =>                                                                                         restart("Error connecting to " + host + ":" + port, e)                                                                   case t: Throwable =>                                                                                                         restart("Error receiving data", t)                                                                                     } finally {          if (socket != null) {                                                                                                        socket.close()                                                                                                             logInfo("Closed socket to " + host + ":" + port)                                                                         }                                                                                                                        }                                                                                                                        }                                                                                                                        }

接收到的資料會被先儲存起來，儲存最終會調用到BlockManager.scala中的函數，那麼BlockManager是如何被傳遞到StreamingContext的呢？利用SparkEnv傳入的，注意StreamingContext建構函式的入參。

處理定時器

資料的儲存有是被socket觸發的。那麼已經儲存的資料被真正的處理又是被什麼觸發的呢？

記得在初始化StreamingContext的時候，我們指定了一個時間參數，那麼用這個參數會構造相應的重複定時器，一旦定時器逾時，調用generateJobs函數。

private val timer = new RecurringTimer(clock, ssc.graph.batchDuration.milliseconds, longTime => eventActor ! GenerateJobs(new Time(longTime)), "JobGenerator")

事件處理函數

 /** Processes all events */                                                                                                private def processEvent(event: JobGeneratorEvent) {                                                                         logDebug("Got event " + event)                                                                                             event match {                                                                                                                case GenerateJobs(time) => generateJobs(time)                                                                              case ClearMetadata(time) => clearMetadata(time)                                                                            case DoCheckpoint(time) => doCheckpoint(time)                                                                              case ClearCheckpointData(time) => clearCheckpointData(time)                                                              }                                                                                                                        }

generteJobs

 private def generateJobs(time: Time) {                                                                                       SparkEnv.set(ssc.env)                                                                                                      Try(graph.generateJobs(time)) match {                                                                                        case Success(jobs) =>                                                                                                        val receivedBlockInfo = graph.getReceiverInputStreams.map { stream =>                                                        val streamId = stream.id                                                                                                   val receivedBlockInfo = stream.getReceivedBlockInfo(time)                                                                  (streamId, receivedBlockInfo)                                                                                            }.toMap                                                                                                                    jobScheduler.submitJobSet(JobSet(time, jobs, receivedBlockInfo))                                                         case Failure(e) =>                                                                                                           jobScheduler.reportError("Error generating jobs for time " + time, e)                                                  }                                                                                                                          eventActor ! DoCheckpoint(time)                                                                                          }

generateJobs->generateJob一路下去會調用到Job.run,在job.run中調用sc.runJob，在具體調用路徑就不一一列出。

 private class JobHandler(job: Job) extends Runnable {    def run() {      eventActor ! JobStarted(job)      job.run()      eventActor ! JobCompleted(job)    }  }

DStream.generateJob函數中定義了jobFunc，也就是在job.run()中使用到的jobFunc

  private[streaming] def generateJob(time: Time): Option[Job] = {    getOrCompute(time) match {      case Some(rdd) => {        val jobFunc = () => {          val emptyFunc = { (iterator: Iterator[T]) => {} }          context.sparkContext.runJob(rdd, emptyFunc)        }        Some(new Job(time, jobFunc))      }      case None => None    }  }

在這個流程中，DStreamGraph起到非常關鍵的作用，非常類似於TridentStorm中的graph.

在generateJob過程中，DStream會通過調用compute函數產生相應的RDD，SparkContext則是將基於RDD的抽象轉換成為多個stage，而執行。

StreamingContext中一個重要的轉換就是DStream到RDD的轉換，而SparkContext中一個重要的轉換是RDD到Stage及Task的轉換。在這兩個不同的抽象類別中，要注意其中getOrCompute和compute函數的實現。

小結

本篇內容有點倉促，內容不夠豐富翔實，爭取回頭有空的時候再好好豐富一下具體的調用路徑。

對於容錯處理機制，本文沒有涉及，待研究明白之後另起一篇進行闡述。

本文章原先以中文撰寫並發佈於 aliyun.com，亦設英文版本，僅作資訊用途。本網站不對文章的準確性，完整性或可靠性或其任何翻譯作出任何明示或暗示的陳述或保證。如對該文章有任何疑慮或投訴，請傳送電郵至 info-contact@alibabacloud.com 並提供相關疑慮或投訴的詳細說明。職員會於 5 個工作天內與您聯絡，一經驗證之後，即會刪除該侵權內容。

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Apache Spark源碼走讀之4 -- DStream即時資料流資料處理

聯繫我們

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support