The most praised dimensionality reduction methods in the seven data analysis fields

Source: Internet
Author: User

The most praised dimensionality reduction methods in the seven data analysis fields

Thanks to Wang Murong for his contributions to the Community of the league

Recently, due to the rapid growth of data records and attribute scale, Big Data processing platform and parallel data analysis algorithm also appeared. At the same time, this also promotes the application of data dimensionality reduction processing. In fact, the amount of data is sometimes overkill. Sometimes a large amount of data in a data analysis application can produce worse performance.

A recent example is the use of the KDD Challenge large data set to predict customer churn. The dataset dimension reaches 15000 dimensions. Most data mining algorithms work directly on the data in a row, causing the algorithm to become more and more slow when the number of data is large. The most important part of the project is to reduce the number of data columns while ensuring that the missing data information is as small as possible.

Taking this project as an example, we begin to explore the data dimensionality reduction methods that most data analysts praise and accept in the current data analysis field.

1. Missing value ratio (Missing values Ratio)

The method is less likely to contain useful information based on columns of data that contain too many missing values. Therefore, you can remove columns with data column missing values that are larger than a certain threshold value. The higher the threshold value, the more active the dimensionality reduction method is, the less dimensionality is reduced. The method is as follows:

2. Low Variance Filter (Variance filter)

Similar to the previous method, the method assumes that columns with very small changes in the data column contain less information. Therefore, all columns with a small variance in the data column are removed. One thing to note is that the variance is related to the data range, so the data needs to be normalized before the method is adopted. The algorithm is as follows:

3. High Correlation filter (Hi Correlation filter)

High correlation filtering suggests that when two columns of data change in a similar trend, the information they contain is also displayed. This allows the machine learning model to be satisfied with a column in a similar column. For the similarity between the numerical columns is expressed by calculating the correlation coefficients, the correlation coefficients for the noun class column can be expressed by calculating the Pierson Caffon value. Two columns with a correlation coefficient greater than a certain threshold retain only one column. It is also important to note that the correlation coefficients are sensitive to the range, so the data must be normalized before the calculation. The algorithm is as follows:

4. Random Forest/Combined tree (random forests)

The composite decision tree is also often used as a random forest, which is useful for feature selection and efficient classifier construction. A commonly used dimensionality reduction method is to produce many large trees for the target attribute, and then find the most informative subset based on the statistical results of each attribute. For example, we were able to generate very shallow trees for a very large data set, and each tree trained only a small fraction of the attributes. If an attribute is often the best split attribute, it is most likely the information feature that needs to be preserved. Statistical scoring of random forest data attributes will reveal to us which property is the best predictor of performance compared to other attributes. The algorithm is as follows:

5. Principal component Analysis (PCA)

Principal component analysis is a statistical process that transforms the original n-dimensional data set into a new data set called the dominant component by orthogonal transformations. In the transformed result, the first principal component has the largest variance value, and each subsequent component has a maximum variance with the preceding principal component orthogonal condition. Saving only the first m (M < n) principal components in a reduced dimension preserves the maximum amount of data. It is important to note that the principal component transformation is sensitive to the scale of the orthogonal vectors. Data needs to be normalized before being transformed. It is also important to note that the new principal component is not generated by the actual system, so the interpretation of the data is lost after the PCA transformation. If the ability to interpret data is important to your analysis, then PCA may not work for you. The algorithm is as follows:

6. Reverse feature elimination (backward Feature elimination)

In this method, all classification algorithms are trained with n features first. For each dimension reduction operation, the classifier is trained n times using n-1 features, and a new n classifier is obtained. The N-1 dimension features used by the classifier with the lowest error rate change in the new classifier as the feature set after dimensionality reduction. Continuous iteration of the process, you can get the results after the dimensionality reduction. The n-k-dimensional feature classifier was obtained during the K-iteration. By choosing the maximum error tolerance rate, we can get the minimum number of characteristics required to achieve the specified classification performance on the selection classifier. The algorithm is as follows:

7. Forward feature Construction (Forward Feature construction)

Forward feature building is the inverse process of inverse feature elimination. In the forward feature process, we start with 1 features, each adding a feature that maximizes the performance of the classifier. Forward feature construction and inverse feature elimination are time consuming. They are typically used to enter datasets that have a relatively low number of dimensions. The algorithm is as follows:

We chose the cut-off data set of KDD Chanllenge to compare the dimensionality reduction, accuracy loss rate, and calculation speed of these dimensionality reduction techniques. Of course, the final accuracy and loss rate is also related to the selected data analysis model. Therefore, the final dimensionality reduction rate and the accuracy of the comparison is carried out in three models, the three models are: Decision tree, neural network and naive Bayesian.

By running an optimized cycle, the optimal cycle termination means that low latitude and high accuracy depend on the seven dimensionality reduction methods and the best classification model. The performance of the last best model is compared with the area of the ROC curve by using all features to train the model's baseline accuracy. The following is a comparison of all comparison results.

From the comparison of the above table, the data dimensionality reduction algorithm can not only improve the speed of the algorithm execution, but also can improve the performance of the analysis model. The AoC in the table has a small increase in the test data set when the data set is used: missing value reduction, low variance filtering, high correlation filtering, or random forest dimensionality reduction.

Indeed, in the age of big data, the more data, the better, seems to have become axioms. Once again, we explained that the performance of the algorithm could lead to less than expected performance when the data data set is too noisy. Removing less or even invalid information can only help us build a more scalable, universal data model. The data model may perform better on the new data set.

Recently, we consulted a data analysis team from LinkedIn, the most commonly used data reduction method in data analysis, in addition to the one mentioned in this blog, including: Random projection (projections), nonnegative matrix decomposition (n0n-negative matrix factorization), Automatic coding (auto-encoders), chi-square detection and information gain (chi-square and information gain), multidimensional calibration (multidimensional Scaling), correlation analysis (Coorespondence analysis), factor analyses (Factor), Clustering (clustering), and Bayesian models (Bayesian Models). Thanks to Asterios Stergioudis, Raoul Savos and Michael would advise on the LinkedIN team.

The workflow described in this blog can be found on the "003_preprocessing/003005_dimensionality_reduction" directory on the Knime EXAMPLES server. The size data set of KDD Challenge: Download.

This post is just a brief summary of the entire project, and if you want to learn more about the details, you can read the relevant white paper, White paper: Links

This post was originally contained in: DATAMININGREPORTING.COM#STHASH.3VHXD9WV.DPUF

The most praised dimensionality reduction methods in the seven data analysis fields

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.