R Language Data Analysis series six

Source: Internet
Author: User
Tags square root

R Language Data Analysis series six --by Comaple.zhang

In the previous section, we talked about the R language, and this section is about the first step in analyzing the data when you get a data set. Exploratory data analysis.

Statistics, which are several indicators of data sets that are of concern in statistics. Frequently used such as the following: Minimum, maximum, four, mean, median, majority, variance, standard deviation. Very poor, skewness, kurtosis

First of all to explain the meaning of the volume, shallow not to say, here is mainly about the uncommon

Majority: The number of occurrences

Variance: The average sum of squared differences between each sample value and mean value

Standard deviation: Also known as mean variance, is the two square root of the variance. To measure the centralization of a data set

Extreme difference: The maximum and minimum values are only poor

Skewness: Assuming that the crest is present on the left side of the normal distribution, the long tail is now on the right. becomes the right-biased (positive-biased) skewness value >0, and the inverse of the distribution is the left-biased (negative-biased) skewness value <0

Kurtosis: It is also relative to the positive distribution. The kurtosis of the normal distribution is 3. Suppose the peak degree >3 graph is fatter, the shorter. Called Thick tails. Kurtosis <3 Graphics The thinner, the higher, called the thin tail

This section data set:

We use the insurance data set of the mass package, which is the insurance data for an insurer.

"District" "Group" "Age" "Holders" "Claims"

Represented once by column: Home address area. Insured vehicle displacement, insured age, number of insured, claim quantity

Install the package with the load data set:

Install.pacakges (' MASS ') # Install package library (MASS) #载入包data (Insurance) # Load data set ins <-Insurance #拷贝一份数据


Explore Row Data analytics

The function of the R package summary can give a summary of the data:

Summary (INS)

District Group Age Holders Claims

1:16 <1l:16 <25:16 Min.   : 3.00 Min. : 0.00

2:16 1-1.5l:16 25-29:16 1st Qu.: 46.75 1st Qu.: 9.50

3:16 1.5-2l:16 30-35:16 median:136.00 median:22.00

4:16 >2l:16 >35:16 mean:364.98 mean:49.23

3rd qu.:327.50 3rd Qu.: 55.50

Max.   : 3582.00 Max. : 400.00

We find that the method gives the frequency distribution for the factor type vector, and the minimum value for the continuous type variable. The 14th decimal point. Median, mean, 34th-digit, maximum

From the results we can see that the data in the holders column is significantly less than the mean, which indicates that the dataset is a biased dataset, and the overall data is centered between 3--327.5. We can continue to view it through the dot graph:

Plot (ins$holders)


The point of view may not be very intuitive, we expect to visually see the changes in data, can be displayed through the histogram:

Col <-C (Brewer.pal (9, ' ylorrd ') [1:9]) h<-hist (ins$holders,breaks=12,col=col) xfit <-seq (min (ins$holders), Max (ins$holders), length=40) Yfit <-dnorm (Xfit,mean=mean (ins$holders), SD=SD (ins$holders)) Yfit <-Yfit*diff (h$ Mids[1:2]) *length (ins$holders) lines (xfit,yfit,col= ' red ', lwd=2)


Variance and Standard deviation

To calculate the variance and standard deviation of the Holders column:

var (ins$holders) SD (ins$holders)


In fact, the variance of single variables and standard deviation is not very significant. Contrast talent is enough to see the similarities and differences of data sets.

Suppose we want to analyze how the user calculates the statistical value after the age group. The aggregate function provides us with a very good method such as the following:

Agg<-aggregate (Ins[4:5],by=list (age=ins$age), SD) Pie (agg$claims,labels=agg$age) agg


Age Holders Claims

1 <25 80.41797 16.55181

2 25-29 141.11414 22.63184

3 30-35 177.34353 24.23694

4 >35 941.66603 103.52228

Corresponds to the grouping statistics after the age group by.

watermark/2/text/ahr0cdovl2jsb2cuy3nkbi5uzxqvy29tyxbszq==/font/5a6l5l2t/fontsize/400/fill/i0jbqkfcma==/ Dissolve/70/gravity/center ">

Skewness and Kurtosis:

In order to calculate skewness and kurtosis we are able to implement function stat ourselves such as the following:

Stat <-Function (x,na.omit=f) {if (na.omit)  x <-x[!is.na (x)]  m<-mean (x)  n<-Length (x)  s <-SD (x) Skew <-sum ((x-m) ^3/s^3)/n Kurt <-sum ((x-m) ^4/s^4)/n-3 return (c (N=round (n), Mean=m,stdev=s,skew=skew , Kurtosis=kurt)} sapply (Ins[4:5],stat)


Holders Claims

N 64.000000 64.000000

Mean 364.984375 49.234375

Stdev 622.770601 71.162399

Skew 3.127833 2.877292

Kurtosis 10.999610 9.377258

We can see that the skewness of holders and claims is greater than 0, then that is to say, these two variables are positive-biased distribution, that is, the data is biased to the left, and kurtosis values are very high. That means that both variables have outliers.

Similarly, we can use the open-box diagram to observe, this section has been introduced, here no longer repeat.

Copyright notice: This article Bo Master original articles, blogs, without consent may not be reproduced.

R Language Data Analysis series six

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.