Data analysis and modeling

Source: Internet
Author: User

Copyright Note: Content from the internet and books

Principal component Analysis PCA
1. Basic Ideas

principal component Analysis (PCA) is a kind of descending dimension method for continuous variables , which can maximize the interpretation of data variation , reduce the data from high dimension to low dimension, and ensure the orthogonal between each dimension . .

The method of principal component analysis is to find the eigenvalues and eigenvectors of the covariance matrix or the correlation coefficient matrix of the variables. It is proved that the characteristic vector corresponding to the maximum eigenvalue is the direction that the covariance matrix mutates most , and so on, the second large eigenvalue corresponds to the eigenvector, which is orthogonal to the first eigenvector and can maximally explain the residual variation of the data. direction, and each characteristic value can measure the degree to which the parties are mutating upward. Therefore, in the principal component analysis, select the maximum number of eigenvalues corresponding to the characteristics of the vector, and map the data in the reference system of these eigenvectors, to achieve the purpose of reducing the dimension (the selection of eigenvectors is less than the original data dimension).

When the variables selected in the analysis have different dimensions, and the difference is relatively large, the correlation coefficient matrix should be chosen for principal component analysis.

Principal component analysis is applicable to the correlation between variables, and the variables are ellipsoid-shaped in three-dimensional space. There is a significant linear correlation between the multivariate variables, which indicates that the principal component analysis is meaningful.

2. Calculation formula of principal component

3. Scatter plot

The original data can be represented using vectors from the original coordinate system, and the eigenvectors of the covariance matrix area and b because ab direction, so all points are mapped to a and Uses a as a reference system, the data , This ignores data in
/span> 4. Steps for principal component analysis

5. Select principal component number the purpose of principal component analysis is to simplify the variables, and the main components of the reservation should be less than the number of original variables. According to the purpose of principal component analysis, the method of number selection is different. The specific retention of a few principal components, should be followed by two principles (two principles used simultaneously, can only consider one): 1. The variance of the single principal component interpretation should not be less than 1 (the characteristic root value cloth is less than 1) 2. The cumulative variation of the selected principal component should be 80% ~ 90% (the cumulative characteristic root value is more than 80% of the total feature root value)

6. Application ScenariosThe application of principal component method is broadly divided into three aspects: 1, a comprehensive score of data, 2, dimensionality todescribe the data, 3, for clustering or regression analysis to provide variable compression. To be able to judge the applicability of the principal component method in the application, can choose the suitable principal component quantity according to the demand.

Second, factor analysis 1. Basic ideas
principal component analysis in general can not be the main component of the meaning of the business interpretation, because the main component in the direction of the general does not happen to some variable weights, and some of the weight of the other variables are small, which is also manifested in the main component weight of the formation of the scatter plot will deviate from the axis. If the axis of the main component can be rotated, the absolute value of the weights of some variables is maximized on one principal component, while the absolute value of the other principal components is the smallest, thus achieving the purpose of classifying variables. correspondingly, this dimension analysis method is called factor analysis. factor analysis is a kind of common continuous variable dimensionality reduction and dimension analysis method, it often uses principal component method as its factor load matrix estimation method, In the direction of eigenvector, weighted by the square root of eigenvalues, and finally by factor rotation, the weights of variables are more polarized on different factors. The maximum variance method is used to rotate the factor, which is a kind of orthogonal rotation.

2. Orthogonal rotation factor model
3. General steps for factor analysis
4. The estimation of the factor load matrix generally uses the principal component analysis method. Select the appropriate number of factors, this step requires the results of principal component analysis, the determination of the number of factors is wider than the principal component analysis, for example, the feature root is greater than 0.7 can be considered to retain.
5. Factor rotation rotation is designed to differentiate the factor load by two levels, either close to 0, or close to 1 or 1, which makes it easy to interpret factors. Divided into: orthogonal rotation and skew rotation. Orthogonal rotation, the information between the factors does not overlap. The most common is the maximum variance rotation, which is an orthogonal rotation to maximize the variance of the load squared.
6. Application of factor analysisFactor analysis is similar to the principal component analysis, which is suitable for the existence of a strong linear relationship between variables, and it can synthesize several indexes which reflect the common variables. The simplest method is to calculate the correlation coefficient matrix of variables, if most of the correlation coefficient value is less than 0.3, the factor analysis is not applicable. There are also some test methods, such as Baxter Spherical test, KMO test, etc.Factor analysis, as a means of dimension analysis, is the necessary step to construct reasonable clustering model and robust classification model, and to reduce the instability of the model caused by the collinearity of explanatory variables.

Clustering analysis is a kind of multivariate statistical analysis method of classification. Classify them according to individual or sample characteristics, so that individuals within the same category have the highest possible homogeneity (homogeneity), while the categories should be as high-quality as possible.

1. Basic logic of cluster analysis

The basic logic of cluster analysis is to calculate the distance or similarity between the observed values. Small distances, high similarity, grouped by similarity.

It can be divided into three steps:

1. Start with n observations and k familiar data;

2. Calculate the distance between n observations 22;

3. The near-distance observation is clustered into a class, the distance is divided into different classes, eventually reaching the maximum distance between groups, the distance within the group is minimized.

2. Types of cluster analysis methods

System Clustering Method (hierarchical Clustering): This method can obtain the ideal classification, but it is difficult to deal with a large number of samples.

K-means Clustering (non-hierarchical clustering, fast clustering): can handle a large number of samples, but not to provide class similarity information, can not interactively determine the number of clusters.

Two-step clustering (first using K-means clustering, followed by hierarchical clustering)
3. System Clustering

System clustering, that is, hierarchical clustering, refers to the formation of class similarity level map, easy to intuitively determine the division between classes. The basic idea is to make n samples from a class, calculate the similarity between 22, and the distance between the classes is equivalent to the distance between samples. The two classes with the smallest measure are combined, then the distance between classes is calculated according to some clustering method, and then the minimum distance criterion and class are followed. This reduces the class each time and lasts until all the samples are classified as one class. This method can obtain the ideal classification, but it is difficult to deal with a large number of samples.

1. Basic steps

(1) Transform Data processing, (not required, when the magnitude difference is large or the indicator variable has different units is necessary)

(2) Constructs n classes, each class contains only one sample;

(3) Calculate the distance between n samples 22;

(4) The two most recent categories of merger distances are new;

(5) Calculate the distance between the new class and the current categories, if the number of classes equals 1, go to 6; otherwise 4;

(6) Draw a cluster diagram;

(7) Determine the number of classes, so as to obtain the classification results.

2. Data preprocessing

The data of different elements tend to have different units and dimensions, the variation of their values may be very large, which will affect the classification results, so when the object of classification features is determined, the continuous variables must be processed before cluster analysis.

In cluster analysis, there are several methods for data processing of commonly used clustering elements:

①z soroes Standardization

② Standard deviation standardization

③ Normalization of normal state

With this normalized new data, the maximum value of each feature is 1, the minimum value is 0, and the remaining values are between 0 and 1.

In order to get a reasonable clustering result, we should not only standardize the data, but also analyze the dimension of the variables. In this paper, the factor analysis is used to analyze the dimension, and the data is processed according to the feature selection factor conversion method, and the clustering analysis is carried out on the saved factor results.

If the variable is biased, the data can be transformed into functions to overcome the skewness, such as logarithmic transformation.

3. Calculation of distance between observation points

An important issue in clustering is the definition of sample distances, generally using Euclidean distance or Minkowski distance, and the Minkowski distance formula is as follows:

4. One of the most important topics in the analysis of clustering among observations is to define the distance between the two classes, including the mean join method, the center of gravity method, and the Ward minimum variance method.

(1) The mean connection method, also known as the full join method, will be the distance between all observations of a class and all observations of another class, respectively, by 22, and the average of all distances as a distance between classes:

(2) The center of gravity Method calculates the distance between the respective center of gravity of the observed class:

(3) Ward Minimum variance method: Based on the theory of variance analysis, if the classification is reasonable, then the difference between the same sample should be smaller, and the sum of the difference between class and class should be larger. Ward minimum Variance method and class, always causes the sum class to minimize the squared and increment of the intra-class deviation. Therefore, the method is seldom affected by outliers, and it has good classification effect in practical application and wide application range. However, the method requires that the distance between samples must be Euclidean distance.

4. K-means ClusterK-means clustering is a fast clustering method, which is suitable for large sample size data. The method can be summed up as: First randomly selected K points as the center point, all the samples with the K center point calculation distance, the nearest sample is classified as a point similar to the center point, and then recalculate the center of each class, again calculate the distance between each sample and the center of the class, and according to the shortest distance principle to re-partition class, So iterate until the class no longer changes.

1. Basic steps (1) set the K value to determine the number of clusters (the seed required for the software to randomly distribute the cluster center).

(2) Calculate the distance from each record to the center of the class (European clustering) and divide it into K class.

(3) The K-type Center (mean) is then used as the new center, and the distance is recalculated.

(4) Iteration to convergence standard stops.

2. Advantages and disadvantages This method has the advantage of fast calculation, can be used for large sample data, the disadvantage is the need to manually set the number of clusters K, and its initial point of different choices may form different clustering results, so often use multiple selection of the initial center point, A stable model is constructed by averaging the results of multiple clusters.
3. Application examples found outliers: such as brush credit level of the violation of the behavior of the normal consumption behavior in the consumption frequency, the average amount of consumption and other aspects of the difference is relatively large, the location of the equivalent to find the anomaly, so the need for the conversion of the variable can not change its original distribution pattern. Commonly used standardized methods such as central standardization, extreme difference standardization will not change the distribution pattern, and before clustering often need to use standardization to eliminate the dimension of variables.

Correspondence Analysis Correspondence Analysis is a kind of data analysis technology, it can help us to study the interactive summary form composed of qualitative variables to reveal the relation between variables. The information for the interactive table is displayed graphically. It is mainly applicable to categorical variables with multiple classes, which can reveal the differences between the categories of the same variable and the correspondence between the categories of different variables. applies to two or more fixed class variables

1. Type Simple Correspondence Analysis: Correspondence analysis of two categorical variables multiple correspondence Analysis: Correspondence analysis of multiple categorical variables (optimal scale)
Continuous type variable analysis and classification variable analysis, continuous variable can be divided into the box before the corresponding analysis. 2. Correspondence Analysis and list analysis of the relationship between the analysis of two categorical variables, the list is a common way, but if the variable classification level, it is often difficult to visually detect the correlation between the classification level, the corresponding analysis method will be used to deal with this problem.
Correspondence analysis is a low-dimensional graphical representation of the relationship between the rows and columns of a linked table, which intuitively reveals the differences between the categories of the same categorical variable and the correspondence between the categories of different categorical variables. In the correspondence analysis, each row of the list corresponds to a point in the graph (usually a two-dimensional), and each column corresponds to a point in the same diagram. Essentially, these points are projections of rows from each row of the list to a two-dimensional European space, which preserves the relationship between rows or columns to the maximum extent possible. 3. Correspondence analysis and the relational analysis of the column Table analysis is a technology developed on the basis of the principal component method, which can make the eigenvalues of the respective covariance matrices equal, respectively, by converting the column tables to the respective eigenvalues of the rows and columns respectively. By using the square root of the eigenvalues, the data of the principal component direction is weighted to ensure that the rows and columns can be compared at the same scale.

4. How to interpret the corresponding Graph 1-overall observation:
2-Observing adjacent areas
3-Vector Analysis-preference ordering
4-The angle of the vector--cosine theorem
5-from the distance in the position to see
6-axis definition and quadrant analysis
7-Product positioning: Ideal point and anti-ideal point model
8-Market segmentation and positioning
5. Advantages and disadvantages of simple correspondence analysis:
The more categories a qualitative variable divides, the more obvious the advantages of this method are.Reveals the relationship between the row variable category and the column variable category.Contact the category visually now in a two-dimensional drawing (corresponding graph).You can turn a nominal or order variable into a space variable.
Disadvantages of simple Correspondence analysis:cannot be used for hypothesis testing of related relationships.The dimensions are to be decided by the researcher.Sometimes it is more difficult to interpret the diagram.Sensitive to extreme values.

Multidimensional scale analysis (MDS) is based on the similarity or distance between the subjects, and the study object is represented in a low-dimensional (two-dimensional or three-dimensional) space, and a graphical method for clustering or dimensional analysis is presented. The relative relationship between the research objects can be explained simply and clearly by the spatial location map presented by multidimensional scale analysis.

1. Similarity or distance measurement

Multidimensional scale analysis is used to measure the degree of dissimilarity (distance) or similarity between samples. Because of the different types of variables, the distance or similarity between samples often needs to be measured in different ways, such as Minkowski distance, chi-square distance, cosine similarity, etc., should be familiar with the principle and applicability of distance/similarity measurement and use correctly.

2. Multidimensional scale analysis principle

3. Application of Multidimensional scale analysis

in the field of market research, the focus is on consumer attitudes, measure consumers ' perceptions and preferences. the research subjects involved are very broad, such as: automobiles, washing tou shui, beverages, fast food, cigarettes and countries, corporate brands, political party candidates and so on. through the MDS analysis can provide the market research to the consumer perception and the good information in the Market research field main research consumer's attitude, Measure consumers ' perceptions and preferences.

In situations where there is a need to compare differences or similarities between samples, multidimensional scale analysis can be used, such as comparing different brand/product similarities to find potential competitors. The end result is often shown in two-dimensional perceptual graphs.

The basic purpose of factor analysis is to use a few factors to describe the relationship between many indicators or factors, the relevant relatively close

Variables into the same class, each type of variable becomes a factor (it is called a factor because it is not observable, that is, it is not

specific variables), with fewer factors reflecting most of the information of the original data.

4. Multidimensional scale analysis and correspondence analysis the difference in Multidimensional scale analysis describes the relationship between row variables, and correspondence analysis describes the relationship between row variables and column variables.

Data analysis and modeling

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.