Case class record (Ts:long, Id:int, value:int) If it is an RDD, we often use Reducebykey to get a record of the latest timestamp, using the following method, Def findlatest (Records:rdd [Record]) (implicit spark:sparksession) = {records.keyby (_.id). reducebykey{(x, y) and if (X.ts > Y.ts) x else y}.val UES} In the dataset you can use the method: Import org.apache.spark.sql.functions._ val newdf = df.groupby (' id). Agg.max (' ts, ' Val ') As ' tmp '). Select ($ "id", $ "tmp.*") why this can be done. Because for structs, or tuple types, the Max method defaults to sorting by the first element as an example of a detailed point: Import org.apache.spark.sql.functions._ val data = Seq ("Michae L ", 1," Event 1 "), (" Michael ", 2," Event 2 "), (" Reynold ", 1," Event 3 "), (" Reynold ", 3," Event 4 "). TODF (" User "," Ti
Me "," event ") Val newesteventperuser = data. GroupBy (' user '). AGG (max (struct (' Time, ' event) ' as ' event)
. Select ($ "User", $ "event.*")//Unnest the struct into top-level columns. Scala> newesteventperuser.show () +-------+----+-------+ | user|time|
Event| +-------+----+-------+ |reynold| 3|event 4| |michael|
2|event 2| +-------+----+-------+ complex can be consulted as follows: Case class Aggregateresultmodel (id:string, Mtype:
String, Healthscore:int, Mortality:float, Reimbursement:float)//Assume that the Rawscores is loaded Behorehand from Json,cs V files val groupedresultset = Rawscores.as[aggregateresultmodel].groupbykey (item = (Item.id,item.mtype)). Re Ducegroups (x, y) = Getminhealthscore (x, y)). Map (_._2)//The binary function used in the Reducegroups def Getminhealt Hscore (X:aggregateresultmodel, Y:aggregateresultmodel): Aggregateresultmodel = {//complex logic for deciding bet Ween which row to keep if (X.healthscore > Y.healthscore) {return y} else if (X.healthscore < Y.healthscor
E) {return x} else {if (X.mortality < y.mortality) {return y} else if (x.mortality > Y.mortality) {return x} else {if (X.reimbursement < y.reimbursement)
return x else return y}}}
Ref:https://stackoverflow.com/questions/41236804/spark-dataframes-reducing-by-key