**Note: I wrote this report in December, (first year of doctor's degree). I recently reviewed my computer and summarized the report when I was learning about paper on KDD, now let's share it with you.**

**After all, I wrote it when I was a beginner, and my views on some things are changing. People who look at it can flip it over at will and tell me wrong.**

**Important part: There are two paper articles corresponding to Chapter 1 and Chapter 2. They can be found in references, and they are relatively new at that time.**

Please note that this article is from: http://www.cnblogs.com/xbinworld

**1 Introduction**

In computer vision, pattern recognition, and data mining, we often encounter high-dimensional data, which may cause many problems, suchAlgorithmReduces the running performance and accuracy. The purpose of Feature Selection technology is to find a useful subset of the original data dimension, and then use some effective algorithms to implement data clustering, classification, retrieval, and other tasks.

The goal of feature selection is to select the most important feature subsets under a specific evaluation criterion. This problem is essentially a Comprehensive Optimization Problem with a high computing cost. The traditional feature selection method is to calculate a score for each feature independently, and then select the first k features based on the score. This score is generally used to evaluate a feature's ability to distinguish different clusters. This method has good results in binary problems, but it is likely to fail in many types of problems.

Based on lebal information of known data, feature extraction methods can be divided into supervised and unsupervised methods. Supervised feature extraction methods often evaluate the importance of features through the correlation between features and labels. However, the label cost is high and it is difficult to calculate on a large dataset. Therefore, unsupervised feature extraction is particularly important. Unsupervised methods only use all the information of the data itself, but cannot use the information of the data label. Therefore, it is more difficult to get better results.

Feature selection is a hot research area. In recent years, many related work has been proposed [2] [3] [4], which has attracted more and more attention; in addition, some research on data spectrum analysis and L1 regularization model also inspired some new work on feature selection. In addition, with the development of computers and networks, more and more people are paying attention to the processing of large-scale data, so that research and application can be truly integrated. Therefore, unsupervised feature selection is more important. In this report, we focus on unsupervised feature extraction methods.

**2****Feature Selection**

Feature selection methods can also be divided into packaging (**Wrapper**) Class methods and filtering (**Filter**. Clustering is a commonly used method for packaging. Many algorithms also consider data feature extraction and clustering. in order to find some features, the performance of Data Clustering can be improved. However, packaging algorithms often have a high computing cost. Therefore, it is difficult to use them in large-scale data mining and analysis.

Filtering methods are relatively common and easy to expand. The maximum variance method may be the simplest, but also very effective algorithm. This method essentially projects data to the direction of the maximum variance. PCA [6] also uses the same idea, but it uses transformed features instead of a subset of original data features.

Although the maximum variance standard can effectively find features to represent data, it cannot distinguish data well. The Laplacian score algorithm can effectively extract features that reflect the potential manifold Structure of data. The Fisher score algorithm can effectively distinguish data, it gives the highest score to the features that best distinguish data points (different types of data points are separated as much as possible, while the same type of data points are aggregated as much as possible.

**2.1****Dimensionality Reduction Method**

Feature Selection Algorithms are closely related to dimensionality reduction algorithms. Many algorithms are designed from some classic dimensionality reduction algorithms, the following describes several common dimension reduction algorithms (feature selection is essentially a dimension reduction ).

**Principal Component Analysis**[6] (PCA) is the most common Linear dimensionality reduction method. Its goal is to map high-dimensional data to a low-dimensional space through a linear projection, it is expected that the variance of the data is the largest in the projected dimension, so as to use a small data dimension and retain the characteristics of a large number of original data points. The specific implementation steps are as follows:

X indicates the matrix, P indicates the dimension, and N indicates the number of data. Y indicates the matrix, D indicates the dimension after dimensionality reduction, and N indicates the number of data.

Step 1: Perform centralized data preprocessing,

Step 2: Take the feature vector corresponding to the largest D feature values of the covariance matrix as the Projection Direction W

Step 3: After dimensionality reduction, data is reduced from P to D.

The kernel method of PCA can also be used in Nonlinear Dimensionality Reduction, that is, kpca [10].

**Laplacian eigenmaps**[8] the intuitive idea is that the points (such as the points connected in a graph) that wish to be related to each other should be as close as possible in the space after dimensionality reduction. Laplacian eigenmaps can reflect the internal manifold Structure of data. Specific algorithm implementation steps:

Step 1: Build a graph

A method is used to construct a graph of all vertices. For example, The KNN algorithm is used to connect the nearest k vertices of each vertex to the top. K is a preset value.

Step 2: determine the weight

Determine the weight between a vertex and a vertex. For example, the heat kernel function is used to determine the weight of a vertex. If the vertex I is connected to the vertex J, the weight of the relationship is set:

(1)

Another optional simplified setting is that if vertex I and j are connected, the weight is 1; otherwise, the weight is 0.

Step 3: feature ing

Calculate the feature vectors and feature values of the Laplace matrix L:

(2)

Where D is the diagonal matrix ,,.

Use the feature vector corresponding to the smallest m non-zero feature values as the result output after dimensionality reduction.

**Locally Linear embedding**[7] (LLE) is a non-linear dimensionality reduction algorithm, which can ensure that the data after dimensionality reduction can better maintain the original manifold Structure.

After Three-dimensional data is mapped to two dimensions using lle, the mapped data can still maintain the original data manifold, which means that lle effectively maintains the original popular data structure.

However, lle is not applicable in some cases. If the data is distributed on the entire closed sphere, lle cannot map it to two-dimensional space, and it cannot maintain the original data manifold. In processing data, we first assume that the data is not distributed on a closed or elliptical sphere.

The lle algorithm considers that each data point can be constructed by a linear weighted combination of its neighboring points. The main steps of the algorithm are divided into three steps: (1) Finding K neighboring points of each sample point; (2) Calculating the local reconstruction weight matrix of this sample point from the neighboring points of each sample point; (3) the output values of the sample point are calculated from the local reconstruction weight matrix of the sample point and its neighboring points. The specific algorithm process is as follows:

Step 1:

The first step of the algorithm is to calculate the K Nearest Neighbor points of each sample point. For example, KNN is used to specify K sample points closest to the obtained sample points (commonly used Euclidean distance) as the nearest neighbor points of the sample points, and K is a preset value.

Step 2:

Calculate the partial reconstruction weight matrix W of the sample points. First, define the reconstruction error:

(3)

And local covariance matrix C:

(4)

It indicates a specific vertex, and its k neighboring vertices are represented.

Therefore, the target function is minimized:

(5)

Here, we get:

(6)

Step 3:

Map all sample points to a low-dimensional space. The ing conditions are as follows:

(7)

Restrictions:,. the above formula can be converted:

(8)

Where:

To minimize the loss function value, take the feature vector corresponding to the minimum m non-zero feature values of Y m. In the processing process, the feature values of m are arranged from small to large, and the first feature value is almost close to zero, so the first feature value is removed. The feature vector corresponding to the feature values between 2nd and M + is usually used as the output result.

Next, we will introduce two new feature selection algorithms: unsupervised multi-clustering feature selection [5] (Chapter 3rd) and unsupervised feature selection PCA [1] (Chapter 4th.

**3****Unsupervised multi-clustering feature selection**

Feature Selection generally does not take into account the structure of the data itself. In fact, many data itself has multi-clustering structure features. A good feature selection method should take into account the following two points:

L The selected features should be able to maintain the clustering structure features of the data. Recent studies show that some artificially generated data has an internal popular data structure, which should be considered in clustering algorithms.

L The selected features should cover all possible clustering in the data. Because different feature dimensions have different effects when distinguishing different clusters, It is not suitable if the selected features only distinguish certain clusters but not all clusters.

**3.1****Spectral embedding Clustering Analysis**

We have discussed Laplacian eigenmaps in Chapter 2nd. Assume that it is a feature vector of formula (2. Each row of Y is a descending representation of data points. K is the internal dimension of data, and each data that reflects the data distribution in this dimension (can be understood as a subject or a concept. When clustering analysis is used, each one can reflect the distribution of data on this clustering. Therefore, K can be set to the number of data clusters.

**3.2****Learning Sparse Coefficient**

After we get y, we can measure the importance of each internal dimension, that is, each column of Y, and the ability of each feature to distinguish data clustering.

For a given feature, we can find a subset of relevant features by minimizing fitting errors, as shown below:

(9)

Here is a m dimension vector (X is N * m dimension matrix), representing the L1-norm. Contains the coefficients used to approximate each feature. Due to the nature of the L1-norm, when large enough, some coefficients will become 0. Therefore, we can select some of the most relevant feature subsets. Formula (9) is essentially a regression problem called lasso.

**3.3****Feature Selection**

We need to select D features from M feature data. For a data that contains k clusters, we can use the method mentioned above to calculate the coefficient vectors of K coefficients, the number of non-zero elements for each element is D (corresponding to d features ). Obviously, if all selected features are used, there may be more than D features. Therefore, the following simple policy is used to select D features.

Define the mcfs score for each feature as follows:

(10)

Sort all features in descending order based on their mcfs score and select the first D features.

**3.3****Computing complexity analysis**

The computing complexity of an algorithm is analyzed as follows:

L The P-nearest neighbor graph needs to be constructed step by step. At the same time, you need to find the p-Nearest Neighbor of each vertex.

L based on the construction of a p-nearest neighbor graph, we need to calculate the first K feature vectors in formula (2). It takes time to use lanczos algorithm.

L use Lars to solve formula (9). The limit is that it takes time, so we need time to calculate k clustering.

L time required for selecting the first D features

Considering that P is fixed to a constant of 5, the overall complexity of the mcfs algorithm is.

**4****Unsupervised Feature Selection****PCA**

**PCA**It is an important linear dimension reduction algorithm, which is widely used in social, economic, biological and other data. We have briefly discussed PCA in Chapter 2nd. Here we will describe PCA from another perspective.

**4.1****Sub-space selection**

Given a data matrix, M is the number of data, and N is the data dimension. It is the dimension (number of features) after Dimensionality Reduction of data, and it is assumed that column A is decentralized. Therefore, PCA returns the first K left singular vectors (a matrix) of matrix A and projects the data to the space where the column vector is located.

So it is the projection matrix of the sub-space. Therefore, the optimal projection is to minimize the number of possible K-dimensional sub-spaces:

(11)

We hope to find an efficient unsupervised Feature Selection Algorithm (about m and n polynomial time) that can pick out k features, so that PCA only works on the results of these k features and the results of all features are very close. To define the closeness, C is a matrix that contains only the features selected from. Calculate the following difference to measure the quality of feature selection:

(12)

Here, the projection matrix (in K-dimensional space formed by the projected column space) represents the pseudo-inverse of the matrix. This is equivalent to the column subset selection problem (CSSP) problem.

In modern statistical data analysis, selecting the original feature selection from high-dimensional data is more advantageous than selecting the operated feature (feature extraction.

**4.1****Two Phases****CSSP**

This section describes a two-phase CSSP. The procedure is as follows:

Algorithm 1:

Input: matrix A, integer k

Output: matrix C, which contains the K column in column

1. Start setting

L calculate the first k right singular vectors of A, expressed

L calculate the sampling probability for each J

L Ling

2. Random Phase

L for, the probability of column J is that the contraction factor is

L generate a sampling matrix and scale down the Matrix

3. confirmation stage

L select K columns of the Matrix to generate the sampling Matrix

L return column K of A, that is, return

4. Repeat steps 2nd and 3rd for 40 times, and return the smallest Column

Specifically, algorithm 1 first calculates the probability of each column of A. the probability distribution depends on the first k right singular vectors of A and is written:

(13)

We can know from the above formula that the algorithm can be calculated as long as it is obtained. The time complexity of this algorithm depends mainly on the time spent in calculation.

In the random phase, algorithm 1 randomly selects columns as the input for the next phase. For, the probability of column J is. If column J is selected, the contraction factor is equal. Therefore, at the end of this stage, we will get the columns and their corresponding shrinkage factors. Because of random sampling, it is generally not equal to C. However, under a high probability, it is not much larger than C. To represent the selected columns and contraction factors, we use the following form:

First, define a sampling matrix. The initial value is a null value. When column J is selected, it is added. Then define the diagonal contraction matrix. When column J is selected, the element J is. Therefore, the output result of the random phase is.

In the confirmation phase, K columns are selected from the columns selected from the previous phase. In fact, a sampling matrix is defined. After this phase, the matrix is obtained as the final result.

**6****References**

[1] boutsidis, C., Mahoney, M. W., drineas, P. unsupervised Feature Selection for principal components analysis. In proceeding of the 14th ACM**Sigkdd**International Conference on Knowledge Discovery and data mining, 2008, 61-69.

[2] Yu, L., ding, C., loscalzo, S. Stable feature selection via dense feature groups. In proceeding of the 14th ACM**Sigkdd**International Conference on Knowledge Discovery and data mining, 2008,803-811.

[3] Forman, G., Scholz, M., rajaram, S. Feature shaping for linear SVM classifiers. In Proceedings of the 15th ACM**Sigkdd**International Conference on Knowledge Discovery and data mining, 2009,299-308.

[4] loscalzo, S., Yu, L., ding, C. Consensus group stable feature selection. In Proceedings of the 15th ACM**Sigkdd**International Conference on Knowledge Discovery and data mining, 2009,567-576.

[5] D, Cai, C, Zhang, X, he. unsupervised Feature Selection for multi-cluster data. To be appeared in**Sigkdd**2010.

[6] Smith, L. I. a tutorial on principal components analysis. Cornell University, USA. 2002.

[7] roweis, S. T., Saul, L. K. Nonlinear Dimensionality allocation ction by Locally Linear embedding. Science. 2000,290 (5500): 2323.

[8] Belkin, M., niyogi, P. Laplacian eigenmaps and spectral techniques for embedding and clustering. Advances in neural information processing systems. 2002,158 5-592.

[9] Tenenbaum, J. B., Silva, V., Langford, J. C. A global geometric framework for Nonlinear Dimensionality modeling. Science. 2000,290 (5500): 2319.

[10] scholkopf, B ., smola,. J ., muller, K. r. kernel principal component analysis. lecture Notes in computer science. 1997,132 7583-588.