R Language Data Analysis series six --by Comaple.zhang
In the previous section, we talked about the R language, and this section is about the first step in analyzing the data when you get a data set. Exploratory data analysis.
Statistics, which are several indicators of data sets that are of concern in statistics. Frequently used such as the following: Minimum, maximum, four, mean, median, majority, variance, standard deviation. Very poor, skewness, kurtosis
First of all to explain the meaning of the volume, shallow not to say, here is mainly about the uncommon
Majority: The number of occurrences
Variance: The average sum of squared differences between each sample value and mean value
Standard deviation: Also known as mean variance, is the two square root of the variance. To measure the centralization of a data set
Extreme difference: The maximum and minimum values are only poor
Skewness: Assuming that the crest is present on the left side of the normal distribution, the long tail is now on the right. becomes the right-biased (positive-biased) skewness value >0, and the inverse of the distribution is the left-biased (negative-biased) skewness value <0
Kurtosis: It is also relative to the positive distribution. The kurtosis of the normal distribution is 3. Suppose the peak degree >3 graph is fatter, the shorter. Called Thick tails. Kurtosis <3 Graphics The thinner, the higher, called the thin tail
This section data set:
We use the insurance data set of the mass package, which is the insurance data for an insurer.
"District" "Group" "Age" "Holders" "Claims"
Represented once by column: Home address area. Insured vehicle displacement, insured age, number of insured, claim quantity
Install the package with the load data set:
Install.pacakges (' MASS ') # Install package library (MASS) #载入包data (Insurance) # Load data set ins <-Insurance #拷贝一份数据
Explore Row Data analytics
The function of the R package summary can give a summary of the data:
Summary (INS)
District Group Age Holders Claims
1:16 <1l:16 <25:16 Min. : 3.00 Min. : 0.00
2:16 1-1.5l:16 25-29:16 1st Qu.: 46.75 1st Qu.: 9.50
3:16 1.5-2l:16 30-35:16 median:136.00 median:22.00
4:16 >2l:16 >35:16 mean:364.98 mean:49.23
3rd qu.:327.50 3rd Qu.: 55.50
Max. : 3582.00 Max. : 400.00
We find that the method gives the frequency distribution for the factor type vector, and the minimum value for the continuous type variable. The 14th decimal point. Median, mean, 34th-digit, maximum
From the results we can see that the data in the holders column is significantly less than the mean, which indicates that the dataset is a biased dataset, and the overall data is centered between 3--327.5. We can continue to view it through the dot graph:
Plot (ins$holders)
The point of view may not be very intuitive, we expect to visually see the changes in data, can be displayed through the histogram:
Col <-C (Brewer.pal (9, ' ylorrd ') [1:9]) h<-hist (ins$holders,breaks=12,col=col) xfit <-seq (min (ins$holders), Max (ins$holders), length=40) Yfit <-dnorm (Xfit,mean=mean (ins$holders), SD=SD (ins$holders)) Yfit <-Yfit*diff (h$ Mids[1:2]) *length (ins$holders) lines (xfit,yfit,col= ' red ', lwd=2)
Variance and Standard deviation
To calculate the variance and standard deviation of the Holders column:
var (ins$holders) SD (ins$holders)
In fact, the variance of single variables and standard deviation is not very significant. Contrast talent is enough to see the similarities and differences of data sets.
Suppose we want to analyze how the user calculates the statistical value after the age group. The aggregate function provides us with a very good method such as the following:
Agg<-aggregate (Ins[4:5],by=list (age=ins$age), SD) Pie (agg$claims,labels=agg$age) agg
Age Holders Claims
1 <25 80.41797 16.55181
2 25-29 141.11414 22.63184
3 30-35 177.34353 24.23694
4 >35 941.66603 103.52228
Corresponds to the grouping statistics after the age group by.
watermark/2/text/ahr0cdovl2jsb2cuy3nkbi5uzxqvy29tyxbszq==/font/5a6l5l2t/fontsize/400/fill/i0jbqkfcma==/ Dissolve/70/gravity/center ">
Skewness and Kurtosis:
In order to calculate skewness and kurtosis we are able to implement function stat ourselves such as the following:
Stat <-Function (x,na.omit=f) {if (na.omit) x <-x[!is.na (x)] m<-mean (x) n<-Length (x) s <-SD (x) Skew <-sum ((x-m) ^3/s^3)/n Kurt <-sum ((x-m) ^4/s^4)/n-3 return (c (N=round (n), Mean=m,stdev=s,skew=skew , Kurtosis=kurt)} sapply (Ins[4:5],stat)
Holders Claims
N 64.000000 64.000000
Mean 364.984375 49.234375
Stdev 622.770601 71.162399
Skew 3.127833 2.877292
Kurtosis 10.999610 9.377258
We can see that the skewness of holders and claims is greater than 0, then that is to say, these two variables are positive-biased distribution, that is, the data is biased to the left, and kurtosis values are very high. That means that both variables have outliers.
Similarly, we can use the open-box diagram to observe, this section has been introduced, here no longer repeat.
Copyright notice: This article Bo Master original articles, blogs, without consent may not be reproduced.
R Language Data Analysis series six