Analysis of categorical variables

Source: Internet
Author: User

The variable values of categorical variables are usually qualitative and descriptive, and can be classified into ordered categorical variables and unordered categorical variables.

unordered categorical variables can be divided into two categories, such as gender (male, female) and multi-class disorder variables such as blood type (Q,A,B,AB)

ordered categorical variables are usually more than three, and there is a degree of difference between categories that can be sorted and compared.

Categorical variables belong to relatively low-level variables, the data information is limited, so in the transformation of variables, is usually advanced to low-level conversion, rarely low-level to advanced conversion.

==================================================

Categorical variables mainly analyze the following points

1. consistency between different attributes of the same variable

2. consistency and correlation between different attributes of multiple variables

3. correlation size between different attributes of multiple variables

=================================================

1. consistency check between different attributes of the same variable

A data table that consists of multiple attributes of a categorical variable can be called a one-dimensional polynomial distribution table, such as

The brand variable consists of a, B, and C three attributes, each with a different attribute value.

By constructing chi-square statistic, we can test the consistency of each attribute in one-dimensional polynomial distribution table.

A multi-item distribution is an extension of two distributions, which can be seen as a distribution of multiple experiments, with several properties:

1. A number of trials consist of n identical tests

2. The test is independent

3. The results of each trial fall within a group in the K Group

4. The tester is concerned about the N1,n2.....nk, where NI equals the number of tests falling in Group I, note: n1+n2+...nk=n

5. The probability that a test result falls within a particular group remains unchanged between Tests. and has

2. Independence and correlation test between different attributes of multiple variables

Multiple attributes of two or more categorical variables are referred to as multidimensional multiple items, and the frequency distribution table of multidimensional multiple items is called a list of columns, as opposed to one-dimensional multiple items.

The list is mainly used to judge the independence and relativity between categorical variables, and to test by constructing chi-square statistic.

There are a few issues to be aware of when applying chi-square validation to the list of tables:

1. Problems with the If number size of each cell in the list

There should be no cells in the list of expected frequencies less than 1, or there should be a large number of cells with a desired frequency less than 5. If the expected frequency in more than 20% cells in a cross-linked table is less than 5, it is generally not advisable to use chi-square testing.

2. Problem with size of sample

The size of the chi-square value is affected by the size of the sample, so the chi-square test is greatly affected by the sample volume. The same two variables, different sample sizes, may draw different conclusions. For example, in a list of tables, if the number of samples in each cell is enlarged by 10 times times, the chi-square value will expand 10 times times. Since the degrees of freedom and significance have not changed, the critical value of the chi-square is constant, thus increasing the likelihood of rejecting the original hypothesis. Therefore, it is necessary to modify the Pearson Chi-square value to eliminate the effect of sample size. Can be modified by the number of column contact, Phi coefficient, etc.

3. Questions about the different classifications of variable values

The different classification of variable values will cause the change of chi-square value, and it is possible to get different conclusions. Therefore, in the classification can not be arbitrary, there should be theoretical or statistical basis. In particular, for fixed or fixed-order variables, the value of the variables should be grouped into groups before the use of Chi-square analysis, and because of the different methods of grouping, it will draw different conclusions; At the same time, for fixed-distance or fixed-order variables with chi-square analysis, they do not make full use of their quantitative information.

The most commonly used list is a two-variable column table, one for the row variable, one for the R property, and the other for the column variable, with C properties. A column table of row C of r rows is also known as a rxc. Such as

3. calculation of correlation size between different attributes of multiple variables

The categorical data in the column table may be ordered categorical variables and unordered categorical variables, the calculation of the correlation coefficient is not the same, we will be divided into three cases: 1. Unordered-unordered 2. Ordered-ordered 3. Unordered-Orderly

First look at the unordered-unordered correlation size calculation:

Φ-phi coefficient:

One of the most common correlation coefficients used to describe the degree of correlation of 2×2 table data, since the φ coefficient can be guaranteed to be between 0-1 for a 2x2 table of tables, so that it is more intuitive and easier to compare, and the greater the correlation is, the stronger the value. The goodness-of-fit chi-square is not used in correlation measurement because the goodness-of-fit card is too dependent on the size of the sample. Dividing the Chi square by N for the sample amount n is the φ factor.

When the number of rows in a list is greater than 2x2, the value of φ is not capped, which results in an inability to compare the coefficients, which is why the φ factor can only be used for 2x2 table tables.

For a specific 2x2-dimensional column table

X1 X2

Y1 a B

Y2 C D

C coefficient, also called column contact number

A correlation coefficient used to describe the degree of correlation of more than 2x2 table data, where the φ coefficient cannot be guaranteed to be between 0-1 when the list is more than 2x2, and the person test uses a C factor, also known as a column number, for the correlation coefficients of more than 2x2 table tables.

The number of column contacts between 0-1, the numeric size depends on the number of rows and columns of the table, the greater the correlation, but the C coefficient can not reach 1, which is a disadvantage of the C coefficient, because as a correlation coefficient, he should have two variables fully correlated, correlation coefficient =1 characteristics.

Others do not recommend using the C factor in a column table that is less than 5x5

Cramer ' s V coefficient

The V coefficient between 0-1, it fixed the φ coefficient is not upper limit and the V coefficient can not reach 1, the greater the correlation is stronger, when the variable x and y are completely unrelated, v=0, when two variables are fully correlated, then v=1.

When the list is 2x2, the v=φ

The relationship between φ coefficient, c coefficient and V coefficient

1. The same list of tables, three coefficients will be different

2. When comparing the degree of correlation between different list variables, ensure that the same factor is used and that the number of rows between the two columns is the same

The above three correlation coefficients are based on chi-square metrics and do not have an intuitive and attractive interpretation. Even if they range between 0 and 1, it's hard to say that 0.49 is a value that reflects what the relationship is. It is possible that the relationship is weak, but there is no operational standard to assess his size. This type of measurement, which was first developed as an approximation of the usual correlation coefficients, has now been supplemented by more easily interpreted measurement coefficients.

In order to avoid the weakness of chi-square-based measurement coefficients, statisticians have developed a variety of other methods, the most prevalent of which is the reduction of the error proportional measurement method (Proportional-reduction-in-error measures, referred to as the pre)

The meaning of the pre value is to use a phenomenon (such as variable x) to predict another phenomenon, such as a variable y, to subtract a percentage of the error.
Pre= (E1-E2)/e1
E1: Error when x variable is not known to estimate y variable (full error)
E2: Know the x variable to estimate the error generated by the Y variable
E1-e2 for the remaining error
Both lambda and tau-y coefficients are coefficients with pre-properties

Lambda (λ) coefficient

This correlation measurement is also called Cuttman's coefficient of predictability, its basic logic is to calculate the value of a fixed class variable to predict the value of another fixed class variable, if the majority as a criterion for prediction, you can subtract how much error. The greater the proportion of the total error of the subtracted error, the greater the degree of correlation between the two variables.

In general, the λ coefficients are valued between 0~1, and larger values indicate a higher degree of correlation

Specifically divided into:

1. Symmetric form-used to measure the relationship between two variables is equivalent, that is, no independent variable and dependent variable. Jane writes the λ coefficient.

2. Asymmetric form-the measurement of the relationship between the two variables is divided by the independent variable and the dependent variable. Jane writes Λy (x is an independent variable and y is the dependent variable)

Example: Interaction between sex and attitudes to smoking (human)

According to the λ coefficient formula there are

Therefore, we can say that there is a moderate degree of correlation between gender and attitudes towards smoking.

The lambda-related measurement method is a tool for predicting the majority of people, without considering other conditions. If the majority appears in a row or column of the frequency distribution table, the lambda factor will be equal to 0, but it does not mean that x and Y must be completely irrelevant. At the same time, it also shows that the lambda coefficient measurement of x, Y correlation level is a more coarse method. Therefore, the tau-y coefficients of Goodman and Kruskal are sometimes used in sociological studies.

Goodman and Kruskal the tau-y coefficient

The sensitivity of this coefficient is higher than that of the lambda coefficient, but it is only suitable for analyzing the asymmetric relationship, which is an asymmetric correlation method, which requires that one of the two definite class variables is an independent variable and one is a dependent variable. The value of the tau-y coefficient is between 0-1 and has the meaning of the reduction error ratio, which is characterized by including all the number of edges and the number of conditions in the calculation.

Τ=0 when x is not related to Y, τ=1 when x is fully related to Y. The tau value is asymmetric, which is defined by the X as an argument and by the prediction of Y, so the tau value is also known as Τy.

"In a class-to-class relationship, if it is asymmetrical, the best choice is to use a tau-y, and if it is a symmetric relationship, it is best to use a lambda factor"

The above is an introduction to some coefficients for the calculation of the correlation size of unordered-unordered variables, followed by orderly-ordered

Analysis of categorical variables

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.