Data Analysis Second: Data feature analysis (System metering analysis)

Last Update:2018-08-17 Source: Internet

Author: User

Tags ranges

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

For successful data analysis, it is very important to grasp the nature of data as a whole, and to use statistics to examine the characteristics of data, mainly to check the degree of concentration, degree of dispersion and distribution shape of the data, which can be used to identify some important properties of the whole data set, and has a great reference for the subsequent data analysis.

one, basic statistics

The basic statistics used to describe the data are mainly divided into three categories, namely, the central trend statistic, the scatter degree statistic and the distribution shape statistic.

1, center trend statistics

The central trend statistic is the statistic that represents the location, and intuitively, given an attribute, where does the value of the majority fall?

(1) Mean value

Mean value (mean) is also called arithmetic mean, descriptive data to guide the average position, mathematical expression: mean value =∑x/n;

In some cases, each value in a set of data can be associated with a weight of WI, which reflects the importance of the dependency value or the frequency of occurrence, which is called the weighted mean value =∑xw/n;

Although the mean is the most useful statistic to describe the center trend of a dataset, it is not always the best way to measure the datacenter, because the mean is sensitive to extreme values (outliers). To counteract the effects of a few extreme values, we can use the intercept mean, which means the mean value after dropping the extreme value.

(2) Median

For skewed (asymmetric) data, the ability to better describe the data center's statistics is the median (median), the median is the intermediate value of an ordered data value, and the median can avoid extreme data, representing the average of the data in general. For example: from small to large, the total number is odd, take the middle number, the total number is even, take the average of two numbers in the middle.

(3) Public number

The majority (mode) is the most frequently occurring value in the variable, usually used to determine the number of qualitative data, such as: User status (Normal, under-cost downtime, application downtime, demolition, extinction), the number of variables is "normal", this situation is normal.

2, which represents the statistic of the degree of data dispersion

The statistic for measuring the degree of data dispersion is mainly the standard deviation and the four sub-position difference.

(1) Standard deviation (or variance)

The standard deviation is used to measure the degree of dispersion of the data distribution, and the low standard deviation means that the data observations tend to be near the mean, with a high standard of deviation representing data walking in a large range of values.

(2) Four sub-extreme difference

The extreme Difference (range), also referred to as the value range, is the difference between the maximum and minimum values in a set of data, and ranges = Max-min.

Percentile (quantile) is the data values in order from small to large, the data into 100 parts. The median is the data in the middle of the data, the first four percentile is recorded as Q1, refers to the 25th percentile data, the third four percentile (Q3), refers to the 75th percentile data.

The quadrupole difference (IQR) = q3-q1, IQR refers to the distance between the first four and the third four, which gives the range covered by half of the data, and is a simple measure of the degree of data dispersion.

3, which represents the statistics of the distribution shape

The distribution shape is measured using the skewness and kurtosis coefficients,

Skewness is the statistic used to measure the symmetry of the data distribution: by measuring the skewness, we can determine the degree of asymmetry and direction of the data distribution.

For a normal distribution (or strictly symmetrical distribution) The skewness is equal to 0
If the skewness is negative, the dispersion of the X-mean value is stronger than the right.
If the skewness is positive, the dispersion on the left side of the X mean is weaker than the right side;

Kurtosis is a statistic used to measure a steep or smooth distribution of data, and by measuring the kurtosis, we can determine whether the distribution of data is steeper or smoother relative to the normal distribution.

The kurtosis of the normal distribution is 3,
When the curve peak of time series is higher than that of normal distribution, kurtosis is greater than 3.
When it is lower than the normal distribution, the kurtosis is less than 3.

(1) Coefficient of skewness

The degree to which the skewness coefficients reflect the center position of the data distribution offset is recorded as SK, which has a sk= (mean one median)/standard deviation. The skewness coefficient is a characteristic number describing the degree of symmetry of the distribution deviation.

The skewness of the normal distribution is 0, skewness <0 The distribution has a negative deviation (left partial state), at this time the data is located to the left of the mean on the right side, there is a tail dragged to the left, indicating that there is extreme value, skewness >0 said distribution has a positive deviation (right bias). The skewness is close to 0, which can be considered as symmetrical distribution. For example, if a distribution is known to deviate from the normal distribution on skewness, the skewness can be used to test the normality of the distribution. The higher the absolute value of skewness, the greater the degree of deflection of the distribution pattern.

(2) Kurtosis coefficient

The kurtosis coefficient (Kurtosis) is used to measure the degree of aggregation of the data in the center, recorded as K, which describes the statistics of the steep and slow degree of all the distribution patterns in the population (compared to the normal distribution, the peak of the normal distribution).

For example: The kurtosis of a normal distribution is a 3,k>3 of the kurtosis coefficient, which indicates that the observed amount is more concentrated and has a shorter tail than the normal distribution; the kurtosis coefficient of k<3 indicates that the observed amount is less concentrated and has a longer tail than the normal distribution.

The coefficient of kurtosis formula is:

example, this article uses the Arthritis dataset in the VCD package to demonstrate how to perform a system measurement analysis:

Head (arthritis)  ID treatment  Sex Age improved   treated Male       Some   treated Male   in       None   Treated Male-       None   treated   Male    Marked  $   Treated Male     Marked   treated   Male   Marked

Where variables improved and sex are factor types, ID and age are numeric types.

Second, the concentration trend measurement

The concentration trend is measured by means of mean, median, and number of people.

1, mean value

The mean is the average of all data, and the mean () function is used to calculate the vector's mean value:

Age.mean <-mean (arthritis$age)

Sometimes, in order to reflect the weights of the different components in the mean, each element Xi in the data is given a weight of WI, so that a weighted average is obtained, using Weighted.mean (x,w) to calculate the weighted average value.

Weighted.mean (X,W)

X is the data vector, W is the weight vector, and each element in x corresponds to a weight value in W.

According to the sex to set the weight (weight), the male's age weight is 95%, the female's age weight is 105%, then the resulting weighted average value is:

AGE.WT <-IfElse (arthritis$sex== "Male", 0.95,1.05) Age.wt.mean <-Weighted.mean (ARTHRITIS$AGE,AGE.WT)

If there is an extreme value in the data or the data is biased, then the mean is not a good measure of the concentration trend of the data, in order to eliminate the impact of a few extreme values, you can use truncated mean or median to measure the concentration trend of the data. The truncation mean is the average value after the extreme is removed.

2, Median

The median is the data in the middle that arranges a set of observations from small to large in order. Use median (x) to calculate the median.

Age.median <-Median (arthritis$age)

3, majority

A majority is the most frequently occurring value in a dataset, and the majority is used for qualitative data. R does not have a standard built-in function to calculate the number of people, so we will create a user-defined function to calculate the number of datasets.

The function takes a vector as input, with the number of values as the output.

GetMode <- Function (v) {   uniqv <- Unique (v)   Uniqv[which.max (tabulate (Match (V, UNIQV))]}

Three, the trend measurement in the distance

Four measures to measure the off-trend:

Range (range) Calculation formula: ranges = Max-min
Standard deviation: Measure the extent to which the data deviates from the mean
Variation coefficient (CV): Coefficient of variation metric standard deviation relative to the mean value of the trend, the formula is: cv= standard deviation/mean
The four-digit spacing (IQR) is the difference between the upper four-digit qu and the next four-bit QL, which contains all the observed values, the larger the value, the greater the variation of the data, and the more obvious the trend of the separation.

To view the arthritis data set trends:

Get_stat <- Function (v) {  V.mean <- mean (v)  V.median <- Median (v)  v.range <-Max (v) -min (v)  v.sd <- SD (v)  V.CV <-v.sd/v. Mean V.IQR <-quantile (v,0.75)-Quantile (v,0.25 ) D.stat <-data.frame (MEAN=V.MEAN,MEDIAN=V.MEDIAN,RANGE=V.RANGE,SD=V.SD,CV=V.CV,IQR=V.IQR, row.names = NULL)} Mystat <-Get_stat (arthritis$age)

Four, skewness and kurtosis

There is no function for calculating skewness and kurtosis in the base installation package, which users can add themselves:

Mystats <-function (x, na.omit=FALSE) {    if(na.omit) x<-x[! is. NA (x)] m<-mean (x) n<-Length (x) s<-SD (x) Skew<-sum ((x-m) ^3/s^3)/N Kurt<-sum ((x-m) ^4/s^4)/N-3    return(c (N=n, Mean=m, Stdev=s, Skew=skew, kurtosis=Kurt))} Myvars<-C ("mpg","HP","WT") sapply (Mtcars[myvars], mystats)

We recommend an article: some explorations on skewness and Kurtosis, citing the results of the kurtosis effect in this article:

The influence of the tail or outliers on kurtosis is positive and the degree of influence is the most. However, the influence of high probability zone on kurtosis is positive, but less, and the negative direction is affected by the medium probability area.

Reference Documentation:

Some explorations on skewness and kurtosis

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More