R Language Data Analysis Series VI --by Comaple.zhang
In the previous section, we talked about the R language mapping, and this section is about how to analyze the data when you get a data set, the first step in the analysis, and the exploratory data analysis.
Statistics, which are several indicators of the data set of concern in statistics, are commonly used as follows: Minimum, maximum, four-digit, mean, median, majority, variance, standard deviation, extreme difference, skewness, kurtosis
First of all to explain the meaning of the volume, shallow not to say, here is mainly about the uncommon
Majority: The number of occurrences
Variance: The average sum of squared differences between each sample value and mean value
Standard deviation: Also known as mean variance, is the two square root of the variance, used to measure the centralization of a data set
Extreme difference: The maximum and minimum values are only difference
Skewness: In relation to the normal distribution, if the crest appears on the left, it indicates that the long tail appears on the right side and becomes the right-biased (positive-biased) skewness value >0, and the inverse of the distribution is the left-biased (negative-biased) skewness value <0
Kurtosis: is also relative to the positive distribution, the normal distribution of the kurtosis is 3, if the peak degree >3 graphics fatter, the shorter, called thick tail, kurtosis <3 graphics thinner, higher, called thin tail
This section data set:
We use the insurance data set of the mass package, which is the insurance data for an insurer.
"District" "Group" "Age" "Holders" "Claims"
Per-column indication: Home address area, insured car displacement, insured age, insured number, claim number
To install the package with the load data set:
Install.pacakges (' MASS ') # Install package library (MASS) #加载包data (Insurance) # Load Datasets ins <-Insurance #拷贝一份数据
Explore Row Data analytics
The function of the R package summary can give a summary of the data:
Summary (INS)
District Group Age Holders Claims
1:16 <1l:16 <25:16 Min. : 3.00 Min. : 0.00
2:16 1-1.5l:16 25-29:16 1st Qu.: 46.75 1st Qu.: 9.50
3:16 1.5-2l:16 30-35:16 median:136.00 median:22.00
4:16 >2l:16 >35:16 mean:364.98 mean:49.23
3rd qu.:327.50 3rd Qu.: 55.50
Max. : 3582.00 Max. : 400.00
We find that the method gives the frequency distribution for the factor type vector, and the method gives the minimum, the 14th, the median, the mean, the 34th, the maximum, for the continuous type of variable.
From the results we can see that the data in the holders column is significantly less than the mean, which means that the dataset is a biased dataset, the overall data is set between 3--327.5, and we can continue to see through the point graph:
Plot (ins$holders)
The point of view may not be very intuitive, we expect to visually see the changes in data, can be shown by the histogram:
Col <-C (Brewer.pal (9, ' ylorrd ') [1:9]) h<-hist (ins$holders,breaks=12,col=col) xfit <-seq (min (ins$holders), Max (ins$holders), length=40) Yfit <-dnorm (Xfit,mean=mean (ins$holders), SD=SD (ins$holders)) Yfit <-Yfit*diff (h$ Mids[1:2]) *length (ins$holders) lines (xfit,yfit,col= ' red ', lwd=2)
Variance and Standard deviation
To calculate the variance and standard deviation of the Holders column:
var (ins$holders) SD (ins$holders)
In fact, the variance of univariate and standard deviation is not very significant, the comparison can see the similarities and differences of data sets.
If we want to analyze how the user calculates the statistics after the age group, the aggregate function gives us a good way to do the following:
Agg<-aggregate (Ins[4:5],by=list (age=ins$age), SD) Pie (agg$claims,labels=agg$age) agg
Age Holders Claims
1 <25 80.41797 16.55181
2 25-29 141.11414 22.63184
3 30-35 177.34353 24.23694
4 >35 941.66603 103.52228
Corresponds to the grouping statistics after the Age column group by.
Skewness and Kurtosis:
In order to calculate skewness and kurtosis we can implement our own function stat as follows:
Stat <-Function (x,na.omit=f) {if (na.omit) x <-x[!is.na (x)] m<-mean (x) n<-Length (x) s <-SD (x) Skew <-sum ((x-m) ^3/s^3)/n Kurt <-sum ((x-m) ^4/s^4)/n-3 return (c (N=round (n), Mean=m,stdev=s,skew=skew , Kurtosis=kurt)} sapply (Ins[4:5],stat)
Holders Claims
N 64.000000 64.000000
Mean 364.984375 49.234375
Stdev 622.770601 71.162399
Skew 3.127833 2.877292
Kurtosis 10.999610 9.377258
We can see that the skewness of holders and claims is greater than 0, then that is to say, these two variables are positive skewness distribution that the data is biased to the left, and the kurtosis value is very high, then the two variables exist outliers.
Similarly we can use the box chart to observe, the last section has been introduced here no longer repeat.
R Language Data Analysis Series VI