Data source, please refer to my blog http://www.cnblogs.com/wwxbi/p/6063613.html
Import Org.apache.Spark.sql.DataFrameStatFunctions
Import Org.apache.spark.sql.functions._
Correlation coefficient
Val df = Range (0,10,step=1). TODF ("id"). Withcolumn ("Rand1", Rand (SEED=10)). Withcolumn ("Rand2", Rand (SEED=27)) DF: Org.apache.spark.sql.DataFrame = [Id:int, rand1:double ... 1 more field]df.show+---+-------------------+-------------------+| id| rand1| rand2|+---+-------------------+-------------------+| 0|0.41371264720975787| 0.714105256846827| | 1| 0.7311719281896606| 0.8143487574232506| | 2| 0.9031701155118229| 0.5282207324381174| | 3|0.09430205113458567| 0.4420100497826609| | 4|0.38340505276222947| 0.9387162206758006| | 5| 0.5569246135523511| 0.6398126862647711| | 6| 0.4977441406613893| 0.9895498513115722| | 7| 0.2076666106201438| 0.3398720242725498| | 8| 0.9571919406508957|0.15042237695815963| | 9| 0.7429395461204413| 0.7302723457066639|+---+-------------------+-------------------+df.stat.corr ("Rand1", "Rand2", "Pearson") Res24: Double =-0.10993962467082698
View the statistical distribution of data
Val Colarray = Array ("Age", "yearsmarried", "religiousness", "Education", "occupation", "rating")//view the statistical distribution of the data Val DESCRDF = Data.describe ("Age", "yearsmarried", "religiousness", "Education", "occupation", "rating") DESCRDF: Org.apache.spark.sql.DataFrame = [Summary:string, age:string ... 5 more fields]descrdf.selectexpr ("Summary", "Round (age,2) as-age", "round (yearsmarried,2) as yearsmarried", "Round (religiousness,2) as religiousness", "Round (education,2) as education", "round (occupation,2) as Occupation "," round (rating,2) as rating "). Show (truncate = false) +-------+-----+------------+-------------+---- -----+----------+------+|summary|age |yearsmarried|religiousness|education|occupation|rating|+-------+-----+--- ---------+-------------+---------+----------+------+|count |601.0|601.0 |601.0 |601.0 |601.0 |601.0 | |mean |32.49|8.18 |3.12 |16.17 |4.19 |3.93 | | StdDev |9.29 |5.57 |1.17 |2.4 |1.82 |1.1 | | Min |17.5 |0.13 |1.0 |9.0 |1.0 |1.0 | | Max |57.0 |15.0 |5.0 |20.0 |7.0 |5.0 |+-------+-----+------------+-------------+---------+- ---------+------+
Number of elements in the statistics field
Number of elements in the statistics field val fi = Data.stat.freqItems (colarray) fi:org.apache.spark.sql.DataFrame = [age_freqitems:array< Double>, yearsmarried_freqitems:array<double> ... 4 more Fields]fi.printschema () root |--age_freqitems:array (nullable = True) | |--element:double (Containsnull = False) |--Yearsmarried_freqitems:array (nullable = True) | |--element:double (Containsnull = False) |--Religiousness_freqitems:array (nullable = True) | |--element:double (Containsnull = False) |--Education_freqitems:array (nullable = True) | |--element:double (Containsnull = False) |--Occupation_freqitems:array (nullable = True) | |--element:double (Containsnull = False) |--Rating_freqitems:array (nullable = True) | |--element:double (Containsnull = False) Val f = fi.selectexpr (| "Size (Age_freqitems)", | "Size (Yearsmarried_freqitems)", | "Size (Religiousness_freqitems)", | "Size (Education_freqitems)", | "Size (Occupation_freqitems)", | "Size (Rating_freqitems)") F:org.apache.spark.sql.dataframe = [Size (age_freqitems): int, size (yearsmarried_freqitems ): int ... 4 more fields]f.show (truncate = false) +-------------------+----------------------------+----------------------- ------+-------------------------+--------------------------+----------------------+|size (age_freqitems) |size ( Yearsmarried_freqitems) |size (religiousness_freqitems) |size (education_freqitems) |size (occupation_freqItems) | Size (rating_freqitems) |+-------------------+----------------------------+-----------------------------+------- ------------------+--------------------------+----------------------+|9 |8 |5 |7 |7 |5 |+------------------ -+----------------------------+-----------------------------+-------------------------+------------------------ --+----------------------+
Elements of a collection field
The elements of the collection field Val f1 = Data.stat.freqItems (Array ("Age", "yearsmarried", "religiousness")) F1: Org.apache.spark.sql.DataFrame = [Age_freqitems:array<double>, yearsmarried_freqitems:array<double> ... 1 more field]f1.show (truncate = false) +------------------------------------------------------+----------------- ------------------------------+-------------------------+|age_freqitems |yearsmarr Ied_freqitems |religiousness_freqitems |+------------------------------------------------------+- ----------------------------------------------+-------------------------+| [32.0, 47.0, 22.0, 52.0, 37.0, 17.5, 27.0, 57.0, 42.0]| [0.75, 0.125, 1.5, 0.417, 4.0, 7.0, 10.0, 15.0]| [2.0, 5.0, 4.0, 1.0, 3.0]|+------------------------------------------------------+-------------------------------- ---------------+-------------------------+//An array of elements sorted f1.selectexpr ("Sort_array (Age_freqitems)", "Sort_array ( Yearsmarried_freqitems) "," Sort_arrAy (Religiousness_freqitems)). Show (truncate = false) +------------------------------------------------------+- ----------------------------------------------+-----------------------------------------+|sort_array (age_ Freqitems, True) |sort_array (Yearsmarried_freqitems, True) |sort_array (religiousness_freqitems , true) |+------------------------------------------------------+-----------------------------------------------+ -----------------------------------------+| [17.5, 22.0, 27.0, 32.0, 37.0, 42.0, 47.0, 52.0, 57.0]| [0.125, 0.417, 0.75, 1.5, 4.0, 7.0, 10.0, 15.0]| [1.0, 2.0, 3.0, 4.0, 5.0] |+------------------------------------------------------+------------------------ -----------------------+-----------------------------------------+//The element of the collection field Val F2 = Data.stat.freqItems (Array (" Education "," occupation "," rating ")) f2:org.apache.spark.sql.DataFrame = [Education_freqitems:array<double> Occupation_freqitems:array<double> ... 1 more Field]f2.show (truncate = false) +-----------------------------------------+----------------------------------- +-------------------------+|education_freqitems |occupation_freqitems |rating_freqitems |+-----------------------------------------+-----------------------------------+-------------------------+| [17.0, 20.0, 14.0, 16.0, 9.0, 18.0, 12.0]| [2.0, 5.0, 4.0, 7.0, 1.0, 3.0, 6.0]| [2.0, 5.0, 4.0, 1.0, 3.0]|+-----------------------------------------+-----------------------------------+--------- ----------------+//The elements of an array f2.selectexpr ("Sort_array (Education_freqitems)", "Sort_array (Occupation_freqitems)", "Sort_array (Rating_freqitems)"). Show (truncate = false) +-----------------------------------------+------------ --------------------------+----------------------------------+|sort_array (Education_freqitems, True) |sort_array (Occupation_freqitems, True) |sort_array (Rating_freqitems, True) |+-----------------------------------------+----- ---------------------------------+----------------------------------+| [9.0, 12.0, 14.0, 16.0, 17.0, 18.0, 20.0]| [1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0] | [1.0, 2.0, 3.0, 4.0, 5.0] |+-----------------------------------------+--------------------------------------+----- -----------------------------+
Statistical analysis of exploratory data of SPARK2 dataframestatfunctions