spark dataframe dataset reducebykey用法

來源:互聯網
上載者:User
case class Record(ts: Long, id: Int, value: Int)如果是rdd,我們經常會用reducebykey擷取到最新時間戳記的一條記錄,用下面的方法def findLatest(records: RDD[Record])(implicit spark: SparkSession) = {  records.keyBy(_.id).reduceByKey{    (x, y) => if(x.ts > y.ts) x else y  }.values}在dataset中可以用一下方法:import org.apache.spark.sql.functions._val newDF = df.groupBy('id).agg.max(struct('ts, 'val)) as 'tmp).select($"id", $"tmp.*")為什麼可以這樣操作呢。因為對於struct,或者tuple類型而言,max方法預設按照第一個元素進行排序處理舉個詳細點的例子:import org.apache.spark.sql.functions._val data = Seq(  ("michael", 1, "event 1"),  ("michael", 2, "event 2"),  ("reynold", 1, "event 3"),  ("reynold", 3, "event 4")).toDF("user", "time", "event")val newestEventPerUser =   data    .groupBy('user)    .agg(max(struct('time, 'event)) as 'event)    .select($"user", $"event.*") // Unnest the struct into top-level columns.scala> newestEventPerUser.show()+-------+----+-------+                                                          |   user|time|  event|+-------+----+-------+|reynold|   3|event 4||michael|   2|event 2|+-------+----+-------+複雜一點可參考如下:case class AggregateResultModel(id: String,                                      mtype: String,                                      healthScore: Int,                                      mortality: Float,                                      reimbursement: Float)// assume that the rawScores are loaded behorehand from json,csv filesval groupedResultSet = rawScores.as[AggregateResultModel].groupByKey( item => (item.id,item.mtype ))      .reduceGroups( (x,y) => getMinHealthScore(x,y)).map(_._2)// the binary function used in the reduceGroupsdef getMinHealthScore(x : AggregateResultModel, y : AggregateResultModel): AggregateResultModel = {    // complex logic for deciding between which row to keep    if (x.healthScore > y.healthScore) { return y }    else if (x.healthScore < y.healthScore) { return x }    else {      if (x.mortality < y.mortality) { return y }      else if (x.mortality > y.mortality) { return x }      else  {        if(x.reimbursement < y.reimbursement)          return x        else          return y      }    }  }


ref:https://stackoverflow.com/questions/41236804/spark-dataframes-reducing-by-key

聯繫我們

該頁面正文內容均來源於網絡整理,並不代表阿里雲官方的觀點,該頁面所提到的產品和服務也與阿里云無關,如果該頁面內容對您造成了困擾,歡迎寫郵件給我們,收到郵件我們將在5個工作日內處理。

如果您發現本社區中有涉嫌抄襲的內容,歡迎發送郵件至: info-contact@alibabacloud.com 進行舉報並提供相關證據,工作人員會在 5 個工作天內聯絡您,一經查實,本站將立刻刪除涉嫌侵權內容。

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.