"Data Analysis R language Combat" study notes the 11th chapter Correspondence analysis

Source: Internet
Author: User

11.2 Correspondence Analysis

In many cases, we are not only concerned with the row or column variables themselves, but the relationship between the row and column variables, which is not explained by the factor analysis method. 1970 French statistician J.p.benzenci proposed correspondence analysis, also called Association analysis, R-Q type factor analysis, which is a multivariate dependent variable statistical analysis technology. It is a very good way to analyze the questionnaire by analyzing the interactive summary table composed of qualitative variables to reveal the difference between the different types of variables and the correspondence between the various types of variables.

Correspondence analysis is a kind of visual data analysis method, and its polyhydroxyalkanoate idea is to show the proportional structure of each element in the row and column of a joint list in the lower dimension space in the form of dots, the advantage lies in being able to display the data of any connection in a few groups, and show it through the visually acceptable mapping, which is intuitive, simple and Convenient, so widely used in market segmentation, product positioning, geological research and computer engineering and other fields.

11.2.1 Theoretical basis

Correspondence analysis is a low-dimensional graph method to find the link between sample wood (row) and index (column), the key is to make use of a data transformation method, so that the original data matrix x containing n sample observations and m variables becomes another matrix z, and Z is a transition-aware array, which is used in the next calculation. Combine samples and variables by Z.

11.2.2 R Language Implementation

The package Mass in R provides two functions, CORRESP () is used for simple correspondence analysis, and MCA () is used to calculate multiple correspondence analysis, usually using the former, whose invocation format is Corresp (x,nf=1,......)

X is the data matrix: NF represents the number of factors calculated in factor analysis, usually taking 2.

Cases


> Ch=data.frame (A=c (47,22,10), B=c (31,32,11), C=c (2,21,25), D=c (1,10,20))

> Rownames (CH) =c ("Pure-chinese", "Semi-chinese", "Pure-english")

> Library (MASS)

> Ch.ca=corresp (ch,nf=2)

> Options (digits=4)

> ch.ca

First canonical correlation (s): 0.5521 0.1409

Row scores:

[, 1] [, 2]

Pure-chinese 1.2069 0.6383

semi-chinese-0.1368-1.3079

pure-english-1.3051 0.9010

Column scores:

[, 1] [, 2]

A 0.9325 0.9196

B 0.4573-1.1655

c-1.2486-0.5417

D-1.5346 1.2773

The results of the analysis show that two factors correspond to the load coefficients of the row and column variables. Correspondence analysis is a kind of visualization multivariate statistical method, it is mainly through the graph analysis to draw the conclusion, in R we use the function Biplot () to extract the scatter plot of the factor analysis, to visually show the relationship between the sample and the variables of each level.

When analyzing the graph, we mainly look at the distance between the two kinds of scatter points, and the distance of the ordinate is not very important to the analysis. Scattered points "pure Chinese character" and the most close to the mathematical results, that the mathematical good people can freely carry out pure Chinese characters read and write; scatter "pure English" and the mathematical results D very close, that the mathematical poor people will not only English, and "semi-Chinese character" Between the mathematical results B and C, indicating that some Chinese characters of the students in general

The function of the mass package is still limited, so some R software users have developed packages specifically designed to handle the corresponding analysis, such as the CA package, which is designed to compute and visualize simple correspondence analysis, multiple and joint correspondence analysis.

Correspondence analysis is widely used in market research, often combined with questionnaires, it is a very important statistic technique in product positioning and market segmentation. In enterprise marketing, it is often necessary to define the product positioning: what kind of consumers use the products produced by the enterprise? Which brand is more popular among different types of consumers? When the amount of data is small, you can use a list of tables to analyze the differences in the choice of brands for different types of consumers. However, there is a problem with the list of columns: when there are many variables and each variable has more than one category, the amount of data is very large, it is difficult to visually discover the intrinsic relationship between variables, then correspondence analysis is an effective solution.

> Brand=data.frame (Low=c (2,49,4,4,15,1), Medium=c (7,7,5,49,2,7), High=c (16,3,23,5,5,14))

> Rownames (Brand) =c ("A", "B", "C", "D", "E", "F")

> Library (CA)

> Options (digits=3)

> Brand.ca=ca (Brand)

> brand.ca

Principal Inertias (eigenvalues):

1 2

Value 0.530966 0.343042

Percentage 60.75% 39.25%

Rows:

A B C D E F

Mass 0.1147 0.271 0.147 0.266 0.101 0.1009

ChiDist 0.7704 1.026 0.906 1.029 0.738 0.7939

Inertia 0.0681 0.285 0.120 0.282 0.055 0.0636

Dim. 1-0.7267 1.399-0.581-0.850 0.988-0.8296

Dim. 2 0.9553-0.200 1.368-1.403 0.281 0.8786

Columns:

Low Medium high

Mass 0.3440 0.353 0.303

ChiDist 1.0058 0.861 0.934

Inertia 0.3480 0.262 0.264

Dim. 1 1.3792-0.778-0.659

Dim. 2-0.0663-1.107 1.367

The analysis results obtained using the function CA () include more information: ChiDist is the chi-square test result of the list of tables; Inertia is the inertia, which is what we call the characteristic root; Dim. 1 and Dim. 2 is the factor load that extracts two factors to the row and column variables. The list of objects that the factor analysis outputs can be viewed through names ().

> Names (brand.ca)

[1] "SV" "nd" "rownames" "Rowmass"

[5] "rowdist" "Rowinertia" "Rowcoord" "Rowsup"

[9] "colnames" "Colmass" "Coldist" "Colinertia"

[] "Colcoord" "Colsup" "Call"

For example, the following statement can get the standard coordinates of a row of two factors:

> Brand.ca$rowcoord

DIM1 Dim2

A-0.727 0.955

B 1.399-0.200

C-0.581 1.368

d-0.850-1.403

E 0.988 0.281

F-0.830 0.879

Plot a scatter plot of factor analysis using the plot () function

Plot (brand.ca)

The corresponding analysis scatter chart is composed of the factor coordinate values of brand category and income category, from which it can be seen that low-income people tend to choose brand B and E, the median income level tends to choose brand d, and high income level tends to brand a. C and F, so that the enterprise has completed the initial market positioning.

"Data Analysis R language Combat" study notes the 11th chapter Correspondence analysis

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.