Website Data analysis: How to measure the degree of data dispersion

Source: Internet
Author: User
Keywords Website Data analysis

Intermediary transaction http://www.aliyun.com/zixun/aggregation/6858.html ">seo diagnose Taobao guest stationmaster buy cloud host technology Hall

We usually use the statistics such as mean, median and number to reflect the concentration trend of the data, but these statistics can not fully respond to the characteristics of the data, even though the data sets with equal values have the possibility of infinite distribution, so we need to combine the degree of data discretization. The statistics commonly used to reflect the degree of data dispersion are as follows:

Extreme difference (Range)

The difference is also called full distance, which is the difference between the maximum value and the minimum value in the dataset:

  

The calculation of the extreme difference is simple, can reflect the discrete situation of the dataset to some extent, but because both the maximum and the minimum values are extreme, without considering the other data items in the middle, it is often affected by the anomaly to reflect the discrete situation of the data.

Four-point distance (interquartile RANGE,IQR)

We usually use box diagrams to represent the distribution characteristics of a dataset:

  

The upper and lower sides of the general rectangular box are respectively the upper four-bit number (75%,Q3) and the lower four-digit (25%,Q1) of the dataset, and the middle horizontal line represents the median of the dataset (50%,MEDIA,Q2), and the four-point distance is calculated by using Q3 minus Q1:

  

If you rank the dataset in ascending order, the value at the 3/4 position of the dataset minus the value of the 1/4 position. The four-point distance avoids the judgment of the dispersion degree of the extreme difference in the data concentration which is unusually large or small. However, the four-point distance is still a simple two-digit subtraction, and no other numerical values are considered, so the overall discretization of the dataset is not fully represented.

Variance (Variance)

The variance uses the mean value as the reference frame, considers the deviation of all numerical relative mean values in the dataset, and sums the average by using the square method to avoid the reciprocal cancellation of positive negative numbers:

  

Variance is the most common statistic used to measure data dispersion.

Standard deviation (Standard deviation)

Variance of the numerical deviation of the average after the square of the arithmetic mean, in order to be able to get a data set with the same magnitude of the statistics, so there is a standard deviation, the standard deviation is the other side of the difference between the root obtained:

  

Based on the mean and standard deviation, we can approximate the center of the dataset and the fluctuation of the numerical value around the center, and calculate the confidence interval of the normal population.

Mean difference (score deviation)

The variance uses the square method to eliminate the positive and negative of the numerical deviation, and the mean difference eliminates the positive and negative of the deviation in absolute terms. The mean difference can be used as a reference system or a median, where the mean value is used:

  

The mean difference relative to the standard deviation is more difficult to be affected by the extreme value, because the standard deviation is calculated by the square of the variance, but the average difference is the absolute value, in fact, is a logical judgment process rather than direct calculation process, so the standard deviation of the calculation process is more straightforward.

Coefficient of variation (coefficient of VARIATION,CV)

The variance, standard deviation and average difference described above are all absolute quantities of numerical values, so it is not possible to evade the influence of numerical units, so these statistics often need to combine the mean and median to effectively evaluate the discretization of datasets. For example, the same standard deviation is a data set of 10, for a large number of data sets may reflect the volatility is small, but for a small number of data sets, fluctuations may also be huge.

The coefficient of variation is to correct this disadvantage by using a relative amount of the standard deviation divided by the mean to reflect the variation of the dataset or the degree of dispersion:

  

The advantage of coefficient of variation is that as a dimensionless quantity, we can compare the divergence of different data sets of measurement units, and the defects are obvious, that is, it cannot reflect the true absolute value level, and it is powerless for data sets with a mean value of 0.

In fact, this article is just a collation of the basics of statistics, can be found from a lot of data, a lot of statistics in the book is in the "Statistical description" chapter of these basic statistics, with the mean, median, and so on a list, rarely through the specific application of statistics to classify, And some foreign books on the introduction of knowledge points from the practical application of the point of view, here recommended "simple and simple Statistics" this book, although the introduction of the basic statistical knowledge, but the readability is stronger, popular and easy to pass, compared to some domestic statistics tutorials, more easily in the brain to establish an effective knowledge index, In the specific application can be more handy.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.