R Language Data Analysis series six

Last Update:2015-09-09 Source: Internet

Author: User

Tags square root

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

R Language Data Analysis series six --by Comaple.zhang

In the previous section, we talked about the R language, and this section is about the first step in analyzing the data when you get a data set. Exploratory data analysis.

Statistics, which are several indicators of data sets that are of concern in statistics. Frequently used such as the following: Minimum, maximum, four, mean, median, majority, variance, standard deviation. Very poor, skewness, kurtosis

First of all to explain the meaning of the volume, shallow not to say, here is mainly about the uncommon

Majority: The number of occurrences

Variance: The average sum of squared differences between each sample value and mean value

Standard deviation: Also known as mean variance, is the two square root of the variance. To measure the centralization of a data set

Extreme difference: The maximum and minimum values are only poor

Skewness: Assuming that the crest is present on the left side of the normal distribution, the long tail is now on the right. becomes the right-biased (positive-biased) skewness value >0, and the inverse of the distribution is the left-biased (negative-biased) skewness value <0

Kurtosis: It is also relative to the positive distribution. The kurtosis of the normal distribution is 3. Suppose the peak degree >3 graph is fatter, the shorter. Called Thick tails. Kurtosis <3 Graphics The thinner, the higher, called the thin tail

This section data set:

We use the insurance data set of the mass package, which is the insurance data for an insurer.

"District" "Group" "Age" "Holders" "Claims"

Represented once by column: Home address area. Insured vehicle displacement, insured age, number of insured, claim quantity

Install the package with the load data set:

Install.pacakges (' MASS ') # Install package library (MASS) #载入包data (Insurance) # Load data set ins <-Insurance #拷贝一份数据

Explore Row Data analytics

The function of the R package summary can give a summary of the data:

Summary (INS)

District Group Age Holders Claims

1:16 <1l:16 <25:16 Min. : 3.00 Min. : 0.00

2:16 1-1.5l:16 25-29:16 1st Qu.: 46.75 1st Qu.: 9.50

3:16 1.5-2l:16 30-35:16 median:136.00 median:22.00

4:16 >2l:16 >35:16 mean:364.98 mean:49.23

3rd qu.:327.50 3rd Qu.: 55.50

Max. : 3582.00 Max. : 400.00

We find that the method gives the frequency distribution for the factor type vector, and the minimum value for the continuous type variable. The 14th decimal point. Median, mean, 34th-digit, maximum

From the results we can see that the data in the holders column is significantly less than the mean, which indicates that the dataset is a biased dataset, and the overall data is centered between 3--327.5. We can continue to view it through the dot graph:

Plot (ins$holders)

The point of view may not be very intuitive, we expect to visually see the changes in data, can be displayed through the histogram:

Col <-C (Brewer.pal (9, ' ylorrd ') [1:9]) h<-hist (ins$holders,breaks=12,col=col) xfit <-seq (min (ins$holders), Max (ins$holders), length=40) Yfit <-dnorm (Xfit,mean=mean (ins$holders), SD=SD (ins$holders)) Yfit <-Yfit*diff (h$ Mids[1:2]) *length (ins$holders) lines (xfit,yfit,col= ' red ', lwd=2)

Variance and Standard deviation

To calculate the variance and standard deviation of the Holders column:

var (ins$holders) SD (ins$holders)

In fact, the variance of single variables and standard deviation is not very significant. Contrast talent is enough to see the similarities and differences of data sets.

Suppose we want to analyze how the user calculates the statistical value after the age group. The aggregate function provides us with a very good method such as the following:

Agg<-aggregate (Ins[4:5],by=list (age=ins$age), SD) Pie (agg$claims,labels=agg$age) agg

Age Holders Claims

1 <25 80.41797 16.55181

2 25-29 141.11414 22.63184

3 30-35 177.34353 24.23694

4 >35 941.66603 103.52228

Corresponds to the grouping statistics after the age group by.

watermark/2/text/ahr0cdovl2jsb2cuy3nkbi5uzxqvy29tyxbszq==/font/5a6l5l2t/fontsize/400/fill/i0jbqkfcma==/ Dissolve/70/gravity/center ">

Skewness and Kurtosis:

In order to calculate skewness and kurtosis we are able to implement function stat ourselves such as the following:

Stat <-Function (x,na.omit=f) {if (na.omit)  x <-x[!is.na (x)]  m<-mean (x)  n<-Length (x)  s <-SD (x) Skew <-sum ((x-m) ^3/s^3)/n Kurt <-sum ((x-m) ^4/s^4)/n-3 return (c (N=round (n), Mean=m,stdev=s,skew=skew , Kurtosis=kurt)} sapply (Ins[4:5],stat)

Holders Claims

N 64.000000 64.000000

Mean 364.984375 49.234375

Stdev 622.770601 71.162399

Skew 3.127833 2.877292

Kurtosis 10.999610 9.377258

We can see that the skewness of holders and claims is greater than 0, then that is to say, these two variables are positive-biased distribution, that is, the data is biased to the left, and kurtosis values are very high. That means that both variables have outliers.

Similarly, we can use the open-box diagram to observe, this section has been introduced, here no longer repeat.

R Language Data Analysis series six

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More