a broad measure of data
The distribution characteristics of the data can be described in three ways: 1 ) The concentration trend of the distribution, reflecting the degree of convergence or aggregation of the data to its central value; 2 ) The dispersion degree of the distribution, reflecting the trend of the data away from its central value; 3 ) The shape of the distribution, reflecting the skewness and peak state of the data distribution.
Measurement of the concentration trend
Category data: Majority
The most frequently occurring variable values in a set of data are mainly used to measure the concentration trend of categorical data and, of course, to measure values as sequential data and trends in numerical data sets.
Sequential data: Median, number of bits
Median, four percentile, number of digits, percentile
Numeric data: Average
It is mainly applicable to the numerical data, according to the different data, the calculation form and formula of the average number:
Simple averages and weighted averages
Comparison of the majority, median and average numbers
Most of the data in the single-peak distribution: the relationship between the majority, median, and mean is as follows-the data distribution is symmetrical, and the majority, median, and average are necessarily equal.
Measurement of the degree of dispersion
Categorical data: The ratio of different audiences
The ratio of the frequency of non-public data to the total frequency is mainly used to measure how many people represent a group of data. The larger the ratio of the audience, the greater the proportion of the frequency of the non-audience array, the less representative of the majority, the smaller the ratio of the number of non-audience, the smaller the proportion of the total frequency, the better the representation of the majority.
It is suitable for measuring the dispersion degree of categorical data.
Sequential data: four-cent difference
reflect the middle 50% The smaller the value of the data, the greater the concentration of data in the middle, and the larger the value, the more fragmented the intermediate data.
It is mainly applicable to the degree of discretization of sequential data.
Numeric data: variance and standard deviation
very poor : susceptible to extreme values.
Average Difference : comprehensively reflects the degree of dispersion of a set of data.
Variance: better reflect the degree of dispersion of data, the most widely used in practice.
Standard deviation: For example, the difference is more practical.
Relative degree of dispersion: discrete coefficients.
Discrete coefficients - coefficient of variation is the ratio of the standard deviation of a set of data to its corresponding average. The greater the dispersion coefficient, the greater the degree of dispersion of the data, the smaller the dispersion coefficient, the less the degree of dispersion of the data.
The measurement of partial state and peak state
Partial state and its measurement
the data distribution is measured symmetrically, and the partial state coefficients are expressed by the partial state coefficients . =0 , indicating that the data distribution is symmetrical.
The partial state coefficient is not equal to 0 , which indicates that the data distribution is asymmetric; if the skewness coefficient is greater than 1 or less than 1 , called a highly skewed distribution, if the skewness coefficients are 0.5~1 or -1~0.5 is considered to be a medium-biased distribution;
Peak State and its measurement ;
the peak state is relative to the standard normal distribution. If a set of data obeys a standard normal distribution, then the value of the peak state coefficient is equal to 0, if the value of the peak state coefficient is not equal to 0, then the distribution is more flat or more sharp than the normal distribution.
Data mining--statistical analysis (III: A broad measure of data)