R Language Data Analysis Series VI

Source: Internet
Author: User
Tags square root

R Language Data Analysis Series VI --by Comaple.zhang

In the previous section, we talked about the R language mapping, and this section is about how to analyze the data when you get a data set, the first step in the analysis, and the exploratory data analysis.

Statistics, which are several indicators of the data set of concern in statistics, are commonly used as follows: Minimum, maximum, four-digit, mean, median, majority, variance, standard deviation, extreme difference, skewness, kurtosis

First of all to explain the meaning of the volume, shallow not to say, here is mainly about the uncommon

Majority: The number of occurrences

Variance: The average sum of squared differences between each sample value and mean value

Standard deviation: Also known as mean variance, is the two square root of the variance, used to measure the centralization of a data set

Extreme difference: The maximum and minimum values are only difference

Skewness: In relation to the normal distribution, if the crest appears on the left, it indicates that the long tail appears on the right side and becomes the right-biased (positive-biased) skewness value >0, and the inverse of the distribution is the left-biased (negative-biased) skewness value <0

Kurtosis: is also relative to the positive distribution, the normal distribution of the kurtosis is 3, if the peak degree >3 graphics fatter, the shorter, called thick tail, kurtosis <3 graphics thinner, higher, called thin tail

This section data set:

We use the insurance data set of the mass package, which is the insurance data for an insurer.

"District" "Group" "Age" "Holders" "Claims"

Per-column indication: Home address area, insured car displacement, insured age, insured number, claim number

To install the package with the load data set:

Install.pacakges (' MASS ') # Install package library (MASS) #加载包data (Insurance) # Load Datasets ins <-Insurance #拷贝一份数据


Explore Row Data analytics

The function of the R package summary can give a summary of the data:

Summary (INS)

District Group Age Holders Claims

1:16 <1l:16 <25:16 Min.   : 3.00 Min. : 0.00

2:16 1-1.5l:16 25-29:16 1st Qu.: 46.75 1st Qu.: 9.50

3:16 1.5-2l:16 30-35:16 median:136.00 median:22.00

4:16 >2l:16 >35:16 mean:364.98 mean:49.23

3rd qu.:327.50 3rd Qu.: 55.50

Max.   : 3582.00 Max. : 400.00

We find that the method gives the frequency distribution for the factor type vector, and the method gives the minimum, the 14th, the median, the mean, the 34th, the maximum, for the continuous type of variable.

From the results we can see that the data in the holders column is significantly less than the mean, which means that the dataset is a biased dataset, the overall data is set between 3--327.5, and we can continue to see through the point graph:

Plot (ins$holders)


The point of view may not be very intuitive, we expect to visually see the changes in data, can be shown by the histogram:

Col <-C (Brewer.pal (9, ' ylorrd ') [1:9]) h<-hist (ins$holders,breaks=12,col=col) xfit <-seq (min (ins$holders), Max (ins$holders), length=40) Yfit <-dnorm (Xfit,mean=mean (ins$holders), SD=SD (ins$holders)) Yfit <-Yfit*diff (h$ Mids[1:2]) *length (ins$holders) lines (xfit,yfit,col= ' red ', lwd=2)


Variance and Standard deviation

To calculate the variance and standard deviation of the Holders column:

var (ins$holders) SD (ins$holders)


In fact, the variance of univariate and standard deviation is not very significant, the comparison can see the similarities and differences of data sets.

If we want to analyze how the user calculates the statistics after the age group, the aggregate function gives us a good way to do the following:

Agg<-aggregate (Ins[4:5],by=list (age=ins$age), SD) Pie (agg$claims,labels=agg$age) agg


Age Holders Claims

1 <25 80.41797 16.55181

2 25-29 141.11414 22.63184

3 30-35 177.34353 24.23694

4 >35 941.66603 103.52228

Corresponds to the grouping statistics after the Age column group by.


Skewness and Kurtosis:

In order to calculate skewness and kurtosis we can implement our own function stat as follows:

Stat <-Function (x,na.omit=f) {if (na.omit)  x <-x[!is.na (x)]  m<-mean (x)  n<-Length (x)  s <-SD (x) Skew <-sum ((x-m) ^3/s^3)/n Kurt <-sum ((x-m) ^4/s^4)/n-3 return (c (N=round (n), Mean=m,stdev=s,skew=skew , Kurtosis=kurt)} sapply (Ins[4:5],stat)


Holders Claims

N 64.000000 64.000000

Mean 364.984375 49.234375

Stdev 622.770601 71.162399

Skew 3.127833 2.877292

Kurtosis 10.999610 9.377258

We can see that the skewness of holders and claims is greater than 0, then that is to say, these two variables are positive skewness distribution that the data is biased to the left, and the kurtosis value is very high, then the two variables exist outliers.

Similarly we can use the box chart to observe, the last section has been introduced here no longer repeat.

R Language Data Analysis Series VI

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.