Analysis and research of important data sets of R language--need to collate analysis and clarify the concept

Source: Internet
Author: User

1.R language important data set analysis needs to be collated and analyzed to clarify the concept of?

In the previous section, we talked about the R language mapping, and this section is about how to analyze the data when you get a data set, the first step in the analysis, and the exploratory data analysis.

Statistics, which are several indicators of the data set of concern in statistics, are commonly used as follows: Minimum, maximum, four-digit, mean, median, majority, variance, standard deviation, extreme difference, skewness, kurtosis

First of all to explain the meaning of the volume, shallow not to say, here is mainly about the uncommon

Majority: The number of occurrences

Variance: The average sum of squared differences between each sample value and mean value

Standard deviation: Also known as mean variance, is the two square root of the variance, used to measure the centralization of a data set

Extreme difference: The maximum and minimum values are only difference

Skewness: In relation to the normal distribution, if the crest appears on the left, it indicates that the long tail appears on the right side and becomes the right-biased (positive-biased) skewness value >0, and the inverse of the distribution is the left-biased (negative-biased) skewness value <0

Kurtosis: is also relative to the positive distribution, the normal distribution of the kurtosis is 3, if the peak degree >3 Graphics fatter, the shorter, called thick tail, kurtosis <3 graphics thinner, higher, called thin tail

This section data set:

We use the insurance data set of the mass package , which is the insurance data for an insurer.

"District" "Group" "Age" "Holders" "Claims"

Per-column indication: Home address area, insured car displacement, insured age, insured number, claim number

To install the package with the load data set:

Install.pacakges (' MASS ') # Install package

Library (MASS) #加载包

Data (Insurance) # Load DataSet

INS <-Insurance #拷贝一份数据

R language Important data set analysis need to collate analysis to clarify the concept?

In the previous section, we talked about the R language mapping, and this section is about how to analyze the data when you get a data set, the first step in the analysis, and the exploratory data analysis.

Statistics, which are several indicators of the data set of concern in statistics, are commonly used as follows: Minimum, maximum, four-digit, mean, median, majority, variance, standard deviation, extreme difference, skewness, kurtosis

First of all to explain the meaning of the volume, shallow not to say, here is mainly about the uncommon

Majority: The number of occurrences

Variance: The average sum of squared differences between each sample value and mean value

Standard deviation: Also known as mean variance, is the two square root of the variance, used to measure the centralization of a data set

Extreme difference: The maximum and minimum values are only difference

Skewness: In relation to the normal distribution, if the crest appears on the left, it indicates that the long tail appears on the right side and becomes the right-biased (positive-biased) skewness value >0, and the inverse of the distribution is the left-biased (negative-biased) skewness value <0

Kurtosis: is also relative to the positive distribution, the normal distribution of the kurtosis is 3, if the peak degree >3 Graphics fatter, the shorter, called thick tail, kurtosis <3 graphics thinner, higher, called thin tail

This section data set:

We use the insurance data set of the mass package , which is the insurance data for an insurer.

"District" "Group" "Age" "Holders" "Claims"

Per-column indication: Home address area, insured car displacement, insured age, insured number, claim number

To install the package with the load data set:

Install.pacakges (' MASS ') # Install package

Library (MASS) #加载包

Data (Insurance) # Load DataSet

INS <-Insurance #拷贝一份数据

2.R Language statistical analysis technology study the classification and skills of major component analysis techniques?

    • What is principal component analysis

Principal component Analysis (Principalcomponent ANALYSIS,PCA) is a comprehensive statistic method which is used to make multi-index into a few synthetic indexes. The principal component analysis method uses the dimensionality reduction technique to convert multiple variables into a few principal components, and these principal components retain most of the information of the original variables, usually represented as linear combinations of primitive variables. Through principal component analysis, we can effectively use a large amount of data for quantitative analysis and reveal the intrinsic relationship between variables.

    • How to explain principal component analysis

(1) We know that the principal component analysis is a new comprehensive index to re-assemble the original variable indicators, and our aim is to select as few principal components as possible. We take the first principal component Y1 as an example, if the variance of the Y1 is the largest in all linear combinations, then it contains the most information. If the first principal component is not sufficient to represent all the information, we then consider the second principal component, and require that the first principal component Y1 existing information does not appear in the second principal component Y2, i.e. two principal components are irrelevant.

(2) Our choice of the main component of the standard is to find the linear function of x, and the corresponding variance to the maximum, and these principal components of the comprehensive interpretation capacity of more than 80%.

Note: The derivation of the main component of the process is more obscure, this article focuses on the implementation of the R language process, interested in the deduction can be added to the private messages reply.

    • The process of principal component analysis
      • multivariate statistical analysis and R Language Modeling (fourth edition)
      • Edit Lock
      • This entry lacks Information Bar , business card map , supplementary content to make the entry more complete, but also can quickly upgrade, quickly to edit it!

This book was revised on the basis of "Management Operations Research (fourth edition)" published by our society in 2010, which was revised on the basis of "multivariate statistical analysis and R language Modeling" (third edition), which systematically discussed the basic theories and methods of multivariate statistical analysis and combined with R language analysis operations. The reader who has basic statistical knowledge can read this book with the aim of combining theory with practical application. The purpose of this book is to introduce the fundamental knowledge, basic theory and software application of multivariate statistical analysis. The main contents of this book are: a summary of multivariate statistical analysis, mathematical expression of multivariate data, multi-data graph, multivariate linear correlation and regression analysis, cluster analysis, discriminant analysis, principal component analysis, factor analysis, correspondence analysis, canonical analysis and comprehensive evaluation methods. All data is analyzed using the R language. Some basic theorems have been given the necessary and concise mathematical deduction, but also pay attention to the diversity of data analysis methods, the methods from the background, the use of procedures, calculation steps to the application of skills and various methods of connection, there are more detailed elaboration, including some recent new developments, the book gives some enlightening cases and exercises, The appendix to the end of the book gives a lot of supplementary knowledge.

Analysis and research of important data sets of R language--need to collate analysis and clarify the concept

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.