For successful data analysis, it is very important to grasp the nature of data as a whole, and to use statistics to examine the characteristics of data, mainly to check the degree of concentration, degree of dispersion and distribution shape of the data, which can be used to identify some important properties of the whole data set, and has a great reference for the subsequent data analysis.
one, basic statistics
The basic statistics used to describe the data are mainly divided into three categories, namely, the central trend statistic, the scatter degree statistic and the distribution shape statistic.
1, center trend statistics
The central trend statistic is the statistic that represents the location, and intuitively, given an attribute, where does the value of the majority fall?
(1) Mean value
Mean value (mean) is also called arithmetic mean, descriptive data to guide the average position, mathematical expression: mean value =∑x/n;
In some cases, each value in a set of data can be associated with a weight of WI, which reflects the importance of the dependency value or the frequency of occurrence, which is called the weighted mean value =∑xw/n;
Although the mean is the most useful statistic to describe the center trend of a dataset, it is not always the best way to measure the datacenter, because the mean is sensitive to extreme values (outliers). To counteract the effects of a few extreme values, we can use the intercept mean, which means the mean value after dropping the extreme value.
(2) Median
For skewed (asymmetric) data, the ability to better describe the data center's statistics is the median (median), the median is the intermediate value of an ordered data value, and the median can avoid extreme data, representing the average of the data in general. For example: from small to large, the total number is odd, take the middle number, the total number is even, take the average of two numbers in the middle.
(3) Public number
The majority (mode) is the most frequently occurring value in the variable, usually used to determine the number of qualitative data, such as: User status (Normal, under-cost downtime, application downtime, demolition, extinction), the number of variables is "normal", this situation is normal.
2, which represents the statistic of the degree of data dispersion
The statistic for measuring the degree of data dispersion is mainly the standard deviation and the four sub-position difference.
(1) Standard deviation (or variance)
The standard deviation is used to measure the degree of dispersion of the data distribution, and the low standard deviation means that the data observations tend to be near the mean, with a high standard of deviation representing data walking in a large range of values.
(2) Four sub-extreme difference
The extreme Difference (range), also referred to as the value range, is the difference between the maximum and minimum values in a set of data, and ranges = Max-min.
Percentile (quantile) is the data values in order from small to large, the data into 100 parts. The median is the data in the middle of the data, the first four percentile is recorded as Q1, refers to the 25th percentile data, the third four percentile (Q3), refers to the 75th percentile data.
The quadrupole difference (IQR) = q3-q1, IQR refers to the distance between the first four and the third four, which gives the range covered by half of the data, and is a simple measure of the degree of data dispersion.
3, which represents the statistics of the distribution shape
The distribution shape is measured using the skewness and kurtosis coefficients,
Skewness is the statistic used to measure the symmetry of the data distribution: by measuring the skewness, we can determine the degree of asymmetry and direction of the data distribution.
- For a normal distribution (or strictly symmetrical distribution) The skewness is equal to 0
- If the skewness is negative, the dispersion of the X-mean value is stronger than the right.
- If the skewness is positive, the dispersion on the left side of the X mean is weaker than the right side;
Kurtosis is a statistic used to measure a steep or smooth distribution of data, and by measuring the kurtosis, we can determine whether the distribution of data is steeper or smoother relative to the normal distribution.
- The kurtosis of the normal distribution is 3,
- When the curve peak of time series is higher than that of normal distribution, kurtosis is greater than 3.
- When it is lower than the normal distribution, the kurtosis is less than 3.
(1) Coefficient of skewness
The degree to which the skewness coefficients reflect the center position of the data distribution offset is recorded as SK, which has a sk= (mean one median)/standard deviation. The skewness coefficient is a characteristic number describing the degree of symmetry of the distribution deviation.
The skewness of the normal distribution is 0, skewness <0 The distribution has a negative deviation (left partial state), at this time the data is located to the left of the mean on the right side, there is a tail dragged to the left, indicating that there is extreme value, skewness >0 said distribution has a positive deviation (right bias). The skewness is close to 0, which can be considered as symmetrical distribution. For example, if a distribution is known to deviate from the normal distribution on skewness, the skewness can be used to test the normality of the distribution. The higher the absolute value of skewness, the greater the degree of deflection of the distribution pattern.
(2) Kurtosis coefficient
The kurtosis coefficient (Kurtosis) is used to measure the degree of aggregation of the data in the center, recorded as K, which describes the statistics of the steep and slow degree of all the distribution patterns in the population (compared to the normal distribution, the peak of the normal distribution).
For example: The kurtosis of a normal distribution is a 3,k>3 of the kurtosis coefficient, which indicates that the observed amount is more concentrated and has a shorter tail than the normal distribution; the kurtosis coefficient of k<3 indicates that the observed amount is less concentrated and has a longer tail than the normal distribution.
The coefficient of kurtosis formula is:
example, this article uses the Arthritis dataset in the VCD package to demonstrate how to perform a system measurement analysis:
Head (arthritis) ID treatment Sex Age improved treated Male Some treated Male in None Treated Male- None treated Male Marked $ Treated Male Marked treated Male Marked
Where variables improved and sex are factor types, ID and age are numeric types.
Second, the concentration trend measurement
The concentration trend is measured by means of mean, median, and number of people.
1, mean value
The mean is the average of all data, and the mean () function is used to calculate the vector's mean value:
Age.mean <-mean (arthritis$age)
Sometimes, in order to reflect the weights of the different components in the mean, each element Xi in the data is given a weight of WI, so that a weighted average is obtained, using Weighted.mean (x,w) to calculate the weighted average value.
Weighted.mean (X,W)
X is the data vector, W is the weight vector, and each element in x corresponds to a weight value in W.
According to the sex to set the weight (weight), the male's age weight is 95%, the female's age weight is 105%, then the resulting weighted average value is:
AGE.WT <-IfElse (arthritis$sex== "Male", 0.95,1.05) Age.wt.mean <-Weighted.mean (ARTHRITIS$AGE,AGE.WT)
If there is an extreme value in the data or the data is biased, then the mean is not a good measure of the concentration trend of the data, in order to eliminate the impact of a few extreme values, you can use truncated mean or median to measure the concentration trend of the data. The truncation mean is the average value after the extreme is removed.
2, Median
The median is the data in the middle that arranges a set of observations from small to large in order. Use median (x) to calculate the median.
Age.median <-Median (arthritis$age)
3, majority
A majority is the most frequently occurring value in a dataset, and the majority is used for qualitative data. R does not have a standard built-in function to calculate the number of people, so we will create a user-defined function to calculate the number of datasets.
The function takes a vector as input, with the number of values as the output.
GetMode <- Function (v) { uniqv <- Unique (v) Uniqv[which.max (tabulate (Match (V, UNIQV))]}
Three, the trend measurement in the distance
Four measures to measure the off-trend:
- Range (range) Calculation formula: ranges = Max-min
- Standard deviation: Measure the extent to which the data deviates from the mean
- Variation coefficient (CV): Coefficient of variation metric standard deviation relative to the mean value of the trend, the formula is: cv= standard deviation/mean
- The four-digit spacing (IQR) is the difference between the upper four-digit qu and the next four-bit QL, which contains all the observed values, the larger the value, the greater the variation of the data, and the more obvious the trend of the separation.
To view the arthritis data set trends:
Get_stat <- Function (v) { V.mean <- mean (v) V.median <- Median (v) v.range <-Max (v) -min (v) v.sd <- SD (v) V.CV <-v.sd/v. Mean V.IQR <-quantile (v,0.75)-Quantile (v,0.25 ) D.stat <-data.frame (MEAN=V.MEAN,MEDIAN=V.MEDIAN,RANGE=V.RANGE,SD=V.SD,CV=V.CV,IQR=V.IQR, row.names = NULL)} Mystat <-Get_stat (arthritis$age)
Four, skewness and kurtosis
There is no function for calculating skewness and kurtosis in the base installation package, which users can add themselves:
Mystats <-function (x, na.omit=FALSE) { if(na.omit) x<-x[! is. NA (x)] m<-mean (x) n<-Length (x) s<-SD (x) Skew<-sum ((x-m) ^3/s^3)/N Kurt<-sum ((x-m) ^4/s^4)/N-3 return(c (N=n, Mean=m, Stdev=s, Skew=skew, kurtosis=Kurt))} Myvars<-C ("mpg","HP","WT") sapply (Mtcars[myvars], mystats)
We recommend an article: some explorations on skewness and Kurtosis, citing the results of the kurtosis effect in this article:
The influence of the tail or outliers on kurtosis is positive and the degree of influence is the most. However, the influence of high probability zone on kurtosis is positive, but less, and the negative direction is affected by the medium probability area.
Reference Documentation:
Some explorations on skewness and kurtosis