PackageCom.latrobe.sparkImportOrg.apache.spark. {sparkconf, Sparkcontext}/*** Created by Spark on 15-1-18.* Countapproxdistinct:rdda method that is useful forRDDThe collection content is de-re-counted. * The statistic is an approximate statistic, the parametersrelativesdcontrol the accuracy of statistics. * RELATIVESDthe smaller the result, the more accurate */Objectcountapproxdistinct {defMain(args:array[String]) {Valconf =NewSparkconf (). Setappname ("Spark-demo"). Setmaster ("Local")Valsc =NewSparkcontext (CONF)/** * build a collection that is divided into -aPartition */ ValA = Sc.parallelize (1To10000, -)//rdd aContent Replication5times, which had50000an element Valb = A++a++a++a++a//The result is9760, does not pass parameters, the default is0.05 println(B.countapproxdistinct ())//The result is9760 println(B.countapproxdistinct (0.05))//8224 println(B.countapproxdistinct (0.1))//10000 println(B.countapproxdistinct (0.001)) }}
Spark RDD Countapproxdistinct