Data analysis and modeling

Data analysis and modeling _ Data analysis

Last Update:2018-08-22 Source: Internet

Author: User

Tags square root

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Principal component Analysis PCA
1. Basic Ideas

Principal component Analysis (PCA) is a kind of reduced-dimension method for continuous variables, which can maximize the interpretation of data variation, reduce data from high dimension to low dimension, and ensure orthogonal between each dimension.

The specific method of principal component analysis is to obtain eigenvalues and eigenvectors of covariance matrix or correlation coefficient matrix of variables, and it is proved that the direction of eigenvector with maximal eigenvalues is the direction of covariance matrix mutation, and then the corresponding eigenvector of the second large eigenvalue, is perpendicular to the first eigenvector and can explain the residual variation of the data to the maximum extent, and each characteristic value can measure the degree of the upward variation of each direction. Therefore, when the principal component analysis is carried out, the characteristic vectors corresponding to the largest number of eigenvalues are selected, and the data is mapped in the reference system composed of these eigenvectors to achieve the goal of dimensionality reduction (the number of eigenvector selected is lower than the dimension of the original data).

When the selected variables in the analysis have different dimensions, and the difference is relatively large, the correlation coefficient matrix should be selected for principal component analysis.

The principal component analysis is applicable to the correlation between variables, and the variable is ellipsoid in three-dimensional space. There is a significant strong linear correlation between multivariate variables, which indicates that principal component analysis is meaningful.

2. Calculation formula of principal component

3. Scatter diagram

The original data can be represented by vectors in the original coordinate system, and the eigenvector of the covariance matrix is a and B, because the variation of a direction is much larger than the B direction, all points are mapped to a and a reference system is used to ᧿ the data, which ignores the variation of the data in b direction, but reduces the two-dimensional data to one dimension.

4. The process of principal component analysis

5. The main component of the selection of principal component analysis is to simplify the variable, one time to retain the main component should be less than the number of original variables. According to the purpose of principal component analysis, the method of number selection is different. The specific retention of several principal components should be followed by two principles (two principles are used at the same time, only one can be considered): 1. The variation of single principal component interpretation should not be less than 1 (characteristic root value cloth less than 1) 2. The cumulative variation of the selected principal component should reach 80% ~ 90% (the cumulative characteristic root value is more than 80% of the total characteristic root value)

6. The application of the main component method is roughly divided into three aspects: 1, comprehensive scoring of the data, 2, dimensionality reduction for the description of the data, 3, for clustering or regression analysis to provide variable compression. It is necessary to judge the applicability of principal component method in application, and to select the appropriate principal component quantity according to the requirement.

Second, factor analysis 1. Basic ideas
In general, the principal component analysis can not be interpreted as the main component of the meaning of the business, because the main component in the direction of the general will not happen to some variable weight, and some other variables are small weights, this is also reflected in the main component weight of the formation of the scatter chart will deviate from the axis. If the axes of the principal component can be rotated, the absolute value of the weights of some variables is maximized on one principal component, and the absolute value is minimal on the other principal components, so that the object of the variable classification is achieved. Correspondingly, this method of dimensional analysis is called factor analysis. Factor analysis is a kind of common continuous variable dimensionality reduction and dimension analysis method, the principal component method is often used as the estimating method of the factor load matrix, in the direction of eigenvector, the square root of eigenvalue is weighted, and finally the factor is rotated to make the weight of the variable more polarized in different factors. The maximum variance method is used to rotate the factor, and this method is an orthogonal rotation.

2. Orthogonal rotation factor model
3. General steps for factor analysis
4. The estimation of factor load matrices generally uses principal component analysis method. Select the appropriate number of factors, this step requires the results of principal component analysis, the number of factors to determine the standard than the main component analysis, for example, the feature root greater than 0.7 can be considered reservations.
5. The goal of rotational rotation of factors is to cause the factor to load two levels of differentiation, or close to 0, or close to-1 or 1, so that it is easy to explain the factors. Divided into: orthogonal rotation and skew rotation. Orthogonal rotation, the information between the factors does not overlap. The most commonly used is the maximum variance rotation, which is an orthogonal rotation to maximize the variance of the load squared.
6. Factor analysis of the application factor analysis is similar to the principal component analysis, applicable to the existence of strong linear relationship between variables, can synthesize a few reflect variables common special indicators. The simplest method is to compute the correlation coefficient matrix of the variable, if most of the correlation coefficient is less than 0.3, the factor analysis is not applicable. There are also some test methods, such as Baxter Ball Test, KMO test. As a means of dimension analysis, factor analysis is a necessary step to construct a reasonable clustering model and a robust classification model, which is used to reduce the instability of the model caused by the collinearity of the explanatory variables.

Cluster analysis clustering analysis is a kind of multivariate statistical analysis method. Classify them according to the characteristics of the individual or sample, so that the individuals within the same category have the highest possible homogeneity (homogeneity), while the categories should have as high a heterogeneity as possible.

1. The basic logic of cluster analysis

The basic logic of cluster analysis is to calculate the distance or similarity between observed values. The distance is small, the similarity is high, grouped according to the similarity degree.

It can be divided into three steps:

1. Starting from n observations and K-familiar data;

2. Calculate the distance between the N observation 22;

3. The distance from the observation of the cluster as a class, the distance is divided into different classes, and eventually achieve the maximum distance between groups, the distance within the group minimized.

2. Types of methods for cluster analysis

System Clustering Method (hierarchical Clustering): This method can obtain the ideal classification, but it is difficult to deal with a large number of samples.

K-means Clustering (non-hierarchical clustering, rapid clustering): can handle a large number of samples, but not to provide class similarity information, can not interact to determine the number of clustering. Two-step clustering (using K-means clustering first, later using hierarchical clustering)
3. The system gathers the class

System clustering, that is, hierarchical clustering, refers to the formation of class similarity level map, to facilitate the intuitive determination of the division between classes. The basic idea is to make n samples into one class and compute the similarity between 22, at which time the distance between the class and the sample is equivalent. The two classes with the smallest measure are merged, then the distance between the classes is computed according to some clustering method, and then the minimum distance criterion is used to form the class. This reduces the class each time, and continues until all the samples are grouped into one class. This method can obtain the ideal classification, but it is difficult to deal with a large number of samples.

1. Basic steps

(1) Transform the data processing (not necessary, when the order of magnitude difference is very large or the indicator variable has different units is necessary)

(2) Construct N class, each class contains only one sample;

(3) Calculating the distance between n samples and 22;

(4) Merging the nearest two classes into a new class;

(5) Calculate the distance between the new class and the current types, if the number of classes equals 1, go to 6; otherwise, return to 4;

(6) Draw the cluster diagram;

(7) Determine the number of classes, so as to obtain the classification results.
2. Data preprocessing

The data of different elements tend to have different units and dimensions, the variation of the numerical value may be very large, which will affect the classification result, so when the object of the classification element is determined, the continuous variable should be processed first before the cluster analysis.

In the cluster analysis, the commonly used clustering elements of data processing methods are as follows:

①z soroes Standardization

② Standard deviation standardization

③ Normal state standardization

After the new data obtained by this standard, the maximum value of each element is 1, the minimum is 0, and the remainder is between 0 and 1.

In order to get a reasonable clustering result, not only the data should be standardized, but also the variables should be analyzed in dimension. In general, the dimension analysis is carried out by factor analysis, the observation data is processed according to the feature selection factor conversion method of the sample, and cluster analysis is carried out on the saved factor results.

If the variable is biased, the data can be transformed by function to overcome the bias, such as logarithmic transformation. 3. Calculation of distance between observation points

An important problem in clustering is to define sample distances, generally using Euclidean distances or Minkowski distances, and the Minkowski formula is as follows:

4. The computation of clustering among observational classes another important question is to define the distance between the two classes, including the mean join method, the center of gravity Method and the Ward minimum variance method.

(1) The mean connection method, also known as the full join method, is to make the distance between all observations of a class and all observations of the other class by 22 respectively, and to find the average distance of all distances as the distance between classes:

(2) The Gravity method calculates the distance between the center of gravity of the observed class:

(3) Ward Minimum variance method: Based on the idea of variance analysis, if the classification is reasonable, then the difference between the same sample of the square sum should be small, class and class deviation between the square sum should be larger. Ward's minimum variance method, when it is class, always causes the class to be the least square and increment of the inner deviation. Therefore, the method is seldom affected by outliers, and the classification effect is good in practical application, and the scope of application is wide. But the method requires that the distance between the samples must be Euclidean distance.

4. K-means Cluster K-means Clustering is a kind of fast clustering method, which is suitable for large sample size data. The method can be summed up as follows: First randomly select K point as the center point, all the samples are calculated at the distance from the K center point, the nearest sample is classified as the same point as the center point, then the center of each class is recalculated, the distance between each sample and the class center is recalculated, and the class is reclassified according to the shortest distance principle, This iteration until the class no longer changes.

1. The basic step (1) sets the K value, determines the cluster number (the software randomly allocates the seed required for the cluster center).

(2) Calculate the distance (European clustering) of each record to the class center, and divide into K class.

(3) then the K-Class center (mean value) is used as the new center, and the distance is recalculated.

(4) Iteration to convergence standard stop.

2. Advantages and disadvantages of this method is the advantage of fast calculation, can be used for large sample data, the disadvantage is that the number of people need to set the cluster K, and its initial point of different choices may form different clustering results, so often use multiple select the initial center point, Finally, a stable model is constructed by averaging the results of multiple clustering.
3. Application examples found that abnormal values: such as brush credit level violators of the behavior and normal consumption behavior in the consumption frequency, the average amount of consumption and other aspects of the difference is relatively large, the location is equivalent to the discovery of anomalies, so that the conversion of variables can not change its original distribution pattern. Common standardized methods such as center standardization, Extreme difference standardization do not change the distribution pattern, and often need to use standardization to eliminate the dimension of variables before clustering.

Correspondence Analysis Correspondence Analysis is a kind of data analysis technology, which can help us to study the relation between variables by analyzing the interactive summary form composed of qualitative variables. The information in the interactive table is displayed graphically. It is mainly applicable to the fixed class variables with multiple categories, which can reveal the differences between the categories of the same variable and the corresponding relationships among different categories of variables. Applies to two or more fixed class variables.

1. Simple Correspondence Analysis: Multiple correspondence analysis of two classification variables: Correspondence analysis of multiple classification variables (optimal scale)
The analysis of continuous variables and the analysis of classification variables, continuous variables can be divided into boxes before the corresponding analysis. 2. The relationship between the corresponding analysis and the table analysis in the analysis of the two classification variables, the list is commonly used, but if the variable classification level, it is often difficult to visually identify the level of the correlation between the classification, this will use the corresponding analysis method to deal with this problem.
Correspondence analysis is a low dimensional graphical representation of the relationships between rows and columns of a linked table, which intuitively reveals the differences between categories of the same classification variable and the corresponding relationships among different categories of classification variables. In the corresponding analysis, each row of the column table corresponds to one point in the graph (usually two-dimensional), and each column corresponds to a point in the same diagram. In essence, these dots are all columns of rows of tables to a two-dimensional Euclidean space projection, which maximizes the relationship between rows or columns. 3. Correspondence analysis and analysis of the relational correspondence of the table analysis is a technology developed on the basis of principal component method, it can make the eigenvalues of each covariance matrix equal when the main component of row and column is decomposed by the transformation of the column table, which makes the eigenvalues of rows and columns equal respectively. Using the square root of eigenvalue to weighted the data of principal component direction, the row and column can be compared at the same scale.

4. Interpretation of the corresponding map 1-general observations:
2-Observe adjacent areas
3-Vector analysis--preference ordering
The angle of 4-vector--Cosine theorem
5-from the location of the distance to see
6-axis definition and quadrant analysis
7-Product positioning: Ideal point and inverse ideal point model
8-Market segmentation and positioning
Http://shenhaolaoshi.blog.sohu.com/133694659.html
5. Advantages and disadvantages of simple correspondence analysis:
The more categories the qualitative variables are divided, the more obvious the advantages of this approach. Reveals the relationship between row variable categories and column variable categories. Contact the categories visually surface now in two-dimensional graphs (corresponding graphs). You can convert a nominal variable or an order variable to a spacing variable.
The disadvantage of simple correspondence analysis: it cannot be used in the hypothesis test of correlation relation. The dimensions are to be decided by the researcher. Sometimes it is more difficult to explain the map. More sensitive to extreme values.

Multidimensional scale analysis (MDS), which is based on the similarity or distance between the objects, is a representation of the object in a low dimension (two-dimensional or three-dimensional) space, and is a graphic method for clustering or dimension analysis. The spatial location map presented by multidimensional scale analysis can be used to explain the relative relationship between the subjects.

1. Similarity or distance measurement

Multidimensional scale analysis is used to measure the degree of dissimilarity (distance) or similarity between samples. Because of the different types of variables, the distance between samples and similarity often need to use different methods to measure, such as Minkowski distance, card-side distance, cosine similarity, etc., should be familiar with distance/similarity measurement principle and applicability, and correct use.

2, multidimensional scale analysis principle

3, multidimensional scale analysis of the application

In the field of market research, we mainly study consumer's attitude and measure consumer's perception and preference. The research subjects involved are very extensive, such as automobiles, washing shampoo, beverages, fast food, cigarettes and national, corporate brands, party candidates, etc. The MDS analysis provides information about consumer perceptions and preferences for market research in the field of market research, which focuses on consumer attitudes, measuring consumer perceptions and preferences.

Multidimensional scale analysis, such as comparing the similarity between different brands/products, can be used to find potential competitors in situations where comparisons between samples are needed. The final result is often shown in two-dimensional perceptual graphs.
4, multidimensional scale analysis and corresponding analysis of the difference Multidimensional scale analysis describes the relationship between row variables, correspondence analysis is to describe the relationship between row variables and column variables.

Vi. Methods of predictive data analysis
1. Simple linear regression 2. Multivariate linear regression 1. Multivariate regression equation

2. Linear regression of the five assumptions the focus and difficulty of linear regression is the model tuning, the whole optimization process can be seen as a gradual adjustment of the model to conform to the linear regression of the five classical assumptions, because the more the model conforms to its former ᨀ hypothesis, the more reliable the prediction results. The five assumptions of linear regression are:
Suppose one: there is a linear relationship between the explanatory variable and the variable being interpreted;
Hypothesis Two: Interpretation of variables and perturbation items can not be related; (the regression coefficient is biased)
Suppose that three: The explanation variable cannot be strongly linear correlation (expansion coefficient); (The standard error of regression coefficient is enlarged)
Hypothesis Four: The perturbation term is independent and distributed (variance test, DW test); (The standard error of the disturbance is not estimated, the T test fails)
Suppose five: The perturbation term obeys the normal distribution (QQ test). (Violation of the T test failure)

3. Model variable selection model variable selection methods are: Forward regression method, backward regression method, stepwise regression method
4. The steps of linear regression analysis

(1) To do the basic analysis of the data, the analysis is the potential of the interpretation of the variables and the underlying relationship between the variables to be interpreted;
(2) The candidate model can be constructed according to the results of preliminary analysis;
(3) To test the validity hypothesis of candidate model;
(4) to detect the collinearity and influence points of the model, and to revise the possible deviations of the model;
(5) The model is modified according to the test result;
(6) To carry out the necessary validity hypothesis test, collinearity and influence point detection of the modified model until the model is no longer needed to be amended;
(7) The predictive test of the modified model. An effective modeling cycle is established to ensure the correctness, validity and accuracy of the model.

5. The residual error of the inspection of residual hypothesis needs to satisfy two hypotheses of independent distribution and distribution. The linear regression hypothesis of residual error can be examined by checking residual scatter plot and residual graph. Residual scatter plot mainly depends on whether the residual error exists in the curve relation with an explanatory variable, and whether the dispersion degree of residual error is related to an explanatory variable. Residual graph is mainly to see whether the residual error has outliers. (1) the residual error and the independent variable scatter plot are parabolic lines. It shows that there exists higher order nonlinear relationship between the explanatory variable x and the interpreted variable Y. The modified method is to explain the higher order form of the variable x in the model, such as X2 (2) residual distribution is different variance. The easiest way to fix this is to get the logarithm of Y. (3) Residual error is autocorrelation. The modified method is simpler to join the first-order hysteresis of the interpreted variable y to return. Use DW to verify the autocorrelation of residuals.
Because the error item U T can not be observed, the behavior of U T can only be judged by residual item e T. if u t or e T is in the form of the following figure (a)-(d), it means that u t exists autocorrelation, if the UT or et render graph (E), then the U t does not exist autocorrelation.

dw=2, dw=0, perturbation completely negative correlation, dw=4, perturbation completely positive correlation.

Whether the residual is normal distribution can be observed QQ map distribution. 6. Outlier values may result in deviations from the fitted curve. Statistics are generally used to identify possible outliers.
Statistics: Student residuals, rstudent residuals, COOK's D, Dfbetas, dffits
Process outliers: Re-examine the data to confirm the validity of the data. If valid, you want to analyze the results that include and delete outliers. In order to better fit the data, it may be necessary to enter higher order items in the model.
7. Common linear identification Variable collinearity tool: Variance expansion value, collinearity analysis (eigenvalue and condition index), no intercept of the collinearity analysis
Variance dilation Value VI greater than 10, strong linear correlation
3. Logical regression When the reaction variable is a classification variable, the structural model needs to use logical regression.
1. Correlation test of classification variables

The correlation between the classification variables can be generally used by the table analysis or the card-side test method. 1. The tabular table is a cross-frequency table formed between the classification levels of two categories of variables, comparing the actual frequency with the expected frequency by calculating the percentage of rows or percentages of the columns.
2. The card-side inspection card can be used for the test of the correlation between the two classification variables, the card-side statistics are as follows:

It can be seen that the statistical ᧿ is actually the difference between the observed frequency and the expected frequency.

2. Logical regression equation
3. Methods for evaluating the performance of models (1) Consistency analysis: Calculating the consistent logarithm, inconsistent logarithm, and equal logarithm to evaluate whether the model is good at predicting its own data. The greater the C value, the better the performance of the model. (2) The interpretation of confusion matrix and the method of evaluating model by ROC curve.
The sensitivity and specificity of the predictive model can be determined by the confusion matrix. Sensitivity refers to the probability that the model "hits", and specificity refers to the probability of "correct negation" of the model. The formula is sensitivity =a/(a+b), specificity =d/(c+d).
The ROC curve is a curve that is drawn based on sensitivity and specificity. The area under the ROC curve refers to the ROC curve and the bottom line, the right line surrounding the area. Because the range of sensitivity and specificity is between [0,1], the area value of the ROC curve is closer to 1, which indicates that the prediction ability of the model is stronger.

Time series: the numerical or statistical observations of a variable or index in a system, arranged in chronological order into a numerical sequence, called time series (Series), also called Dynamic Data.
1. Trend Decomposition Method 1. Time series Change Form
The main factors to be considered in the time series are:
Long-term trends (long-term trend) time series may be fairly stable or present a trend over time. Time series trends are generally linear (linear), two-time equations (quadratic) or exponential functions (exponential function). Seasonal Variation (seasonal variation) a sequence of repetitive behaviors, which varies by time. Seasonal changes are usually related to dates or climates. Seasonal changes are usually related to the annual cycle. Periodic changes (cyclical variation) are relative to seasonal variations, and the time series may undergo "periodic changes". Cyclical changes are usually due to economic changes. Random effects (Random effects)

As shown in the figure, the black curve represents the original value of the time series, and the long-term trend change of the time series can be determined according to the time trend of the original sequence. And many industries are in a trend of seasonal change. For example, the transportation industry, wind power industry. For example, the price of fruit and vegetables. Cyclical trends have also become cyclical trends. such as economic cycle trends. In contrast, cyclical and seasonal trends are the more robust trend changes in the original sequence. The stochastic trend without rules is difficult to predict and has a large fluctuation. Thus, the splitting of time series is usually split between the more robust long-term cycles and seasonal trends, regardless of the impact of stochastic trends.

2. Time Series model

2. Classification of time series analysis and prediction method

Smoothing Prediction Method

Both the moving average method and the exponential smoothing method are used to predict the future trend by using the arithmetic average and the weighted average method to make the time series as random variables. The resulting trendline is smoother than the connection to the actual data point, so it is called the smoothing prediction method.

Trend Extrapolation Prediction Method

According to the statistical data of the historical development of the predicted object, a predetermined time function is proposed and used to describe the development trend of the forecast target.

Stationary Time Series Prediction method

Because the stochastic characteristics of stationary time series do not change with time, the parameters of the time series model can be estimated by using the data of the past, so as to predict the future.
3. Stationary time series ARMA model

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More