Data mining--statistical analysis (III: A broad measure of data)

Source: Internet
Author: User

a broad measure of data

The distribution characteristics of the data can be described in three ways: 1 ) The concentration trend of the distribution, reflecting the degree of convergence or aggregation of the data to its central value; 2 ) The dispersion degree of the distribution, reflecting the trend of the data away from its central value; 3 ) The shape of the distribution, reflecting the skewness and peak state of the data distribution.

Measurement of the concentration trend

Category data: Majority

The most frequently occurring variable values in a set of data are mainly used to measure the concentration trend of categorical data and, of course, to measure values as sequential data and trends in numerical data sets.

Sequential data: Median, number of bits

Median, four percentile, number of digits, percentile

Numeric data: Average

It is mainly applicable to the numerical data, according to the different data, the calculation form and formula of the average number:

Simple averages and weighted averages

Comparison of the majority, median and average numbers

Most of the data in the single-peak distribution: the relationship between the majority, median, and mean is as follows-the data distribution is symmetrical, and the majority, median, and average are necessarily equal.

Measurement of the degree of dispersion

Categorical data: The ratio of different audiences

The ratio of the frequency of non-public data to the total frequency is mainly used to measure how many people represent a group of data. The larger the ratio of the audience, the greater the proportion of the frequency of the non-audience array, the less representative of the majority, the smaller the ratio of the number of non-audience, the smaller the proportion of the total frequency, the better the representation of the majority.

It is suitable for measuring the dispersion degree of categorical data.

Sequential data: four-cent difference

reflect the middle 50% The smaller the value of the data, the greater the concentration of data in the middle, and the larger the value, the more fragmented the intermediate data.

It is mainly applicable to the degree of discretization of sequential data.

Numeric data: variance and standard deviation

very poor : susceptible to extreme values.

Average Difference : comprehensively reflects the degree of dispersion of a set of data.

Variance: better reflect the degree of dispersion of data, the most widely used in practice.

Standard deviation: For example, the difference is more practical.

Relative degree of dispersion: discrete coefficients.

Discrete coefficients - coefficient of variation is the ratio of the standard deviation of a set of data to its corresponding average. The greater the dispersion coefficient, the greater the degree of dispersion of the data, the smaller the dispersion coefficient, the less the degree of dispersion of the data.

The measurement of partial state and peak state

Partial state and its measurement

the data distribution is measured symmetrically, and the partial state coefficients are expressed by the partial state coefficients . =0 , indicating that the data distribution is symmetrical.

The partial state coefficient is not equal to 0 , which indicates that the data distribution is asymmetric; if the skewness coefficient is greater than 1 or less than 1 , called a highly skewed distribution, if the skewness coefficients are 0.5~1 or -1~0.5 is considered to be a medium-biased distribution;

Peak State and its measurement ;

the peak state is relative to the standard normal distribution. If a set of data obeys a standard normal distribution, then the value of the peak state coefficient is equal to 0, if the value of the peak state coefficient is not equal to 0, then the distribution is more flat or more sharp than the normal distribution.

Data mining--statistical analysis (III: A broad measure of data)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.