Statistical analysis based on R--Exploratory data analysis

Source: Internet
Author: User
Tags benchmark

Statistical analysis of data is divided into descriptive statistical analysis and statistical inference, the former is also known as exploratory statistical analysis, which is to explore the main distribution characteristics of data by drawing statistical graphs, compiling statistical tables and calculating statistics, and revealing the existing laws. Exploratory data analysis is the basis for post-statistical inference.
This paper focuses on the digital exploration of data sets. In the package Daag, there is an embedded dataset "Possum", which includes 14 eigenvalues, such as the Age of 104 possum (possum), the length of the tail, and the total length of the seven regions from southern Victoria to Queens, which are analyzed using this set of data.

#数据概况
Library (DAAG)
data (Possum)
nrow (Possum)   #显示数据集的行, column, Dimension
ncol (Possum)
Dim (Possum)
Head (Possum)  #显示数据集的前若干条
Attributes (Possum)  #获取数据集属性列表

STR (possum)   #获取数据样本数, number of variables, type and value of each variable

Summary (Possum) #获取数据集变量概况

#变量详情
Library (HMISC)
describe (Possum[,1:3])


Note: For each variable, the total number of samples (n) is given, the number of missing samples (missing), the number of levels (unique), and the values, frequencies, and frequency of each level are listed. It is important to note that for case variables, the output gives the lowest and highest 5 horizontal values, which are most likely to be outliers when the data distribution is biased.

Library (fbasics)  #用于做时间序列统计分析包, also available for general data set
basicstats (possum$case)


Note: The output includes the number of samples (Nobs), the missing value (NAs), the minimum maximum value, and the unique indicator, the sum of the variable values (sums), the standard error mean (SE Mean), the 95% confidence level upper and lower bounds, variance, standard error, and two distribution index skewness and kurtosis.

#分布指标  (Here we mainly introduce two important distribution indicators-skewness and kurtosis, some common probability distributions tend to be represented by a histogram and other visual means)
library (timedate)
skewness (Possum[,6:7]) # Calculate the skewness of these two column variables

kurtosis (Possum[,6:7]) #计算这两列变量的峰度

Note: skewness is used to measure the amount of data in the heap, with a positive distribution as a benchmark. When the subordination is too distributed, the skewness is 0, and when it is between [ -1,1], the symmetry of the data distribution is strong; When the absolute value is greater than 1 o'clock, the data is considered to have a significant bias, which is the trend of right bias and the left side.
Kurtosis is used to measure the magnitude of the data distribution pattern, with a positive distribution as a benchmark. When the value is 0 o'clock, the description is the same as the positive distribution, that is, the standard kurtosis, when the kurtosis is greater than 0 o'clock, the distribution of the data is steeper than the positive distribution, and is the cusp peak; When the kurtosis is less than 0 o'clock, the data distribution is flatter compared to the positive distribution, which is the flat peak degree.

#缺失值
Library (MICE)
Md.pattern (possum)  #显示数据集中缺失值分布的情况


Note: The leftmost column: 101 indicates the total number of samples without missing values, 2 indicates that age is missing 2 samples, 1 indicates that footlgth is missing 1 samples, and the bottom row corresponds to the number of samples missing for each attribute, where the last 3 indicates the total number of missing values, and the rightmost column indicates that the corresponding row has several variables missing.

#相关性
Cor (possum$case,possum$site)
var=c (5:9)
Cor_matrix=cor (possum[var],use= "pairwise") # Calculate correlation coefficients for 5 variable 22
library (ellipse) #可视化相关图
Plotcorr (Cor_matrix,col=rep (C ("White", "Black"), 5))


Note: The width of the circle indicates the height of the correlation, the narrower the circle of the two variables, indicates the higher the correlation.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.