**Deep Learning (ii) sparse filtering sparse Filtering**

Zouxy09@qq.com

Http://blog.csdn.net/zouxy09

I have read some papers at ordinary times, but I always feel that I will slowly forget it after reading it. I did not seem to have read it again one day. So I want to sum up some useful knowledge points in my thesis. On the one hand, my understanding will be deeper, and on the other hand, it will facilitate future surveys. You can also share your blog with us. Because of the limited foundation, some of my understanding of the paper may be incorrect. I hope you will not give me any comments. Thank you.

The thesis in this article comes from:

Sparse filtering, J. ngiam, P. Koh, Z. Chen, S. Bhaskar, A. Y. Ng. nips2011. The corresponding Matlab code is included in the supporting materials of this paper, and the code is very brief. But I haven't read it yet.

The following is your understanding of some of these knowledge points:

**Sparse Filtering**

This article focuses on the unsupervised learning unsupervised feature learning algorithm. Because the General unsupervised algorithm needs to adjust many additional parameters hyperparameter. This article proposes a simple algorithm: sparse filtering. It only has one hyperparameter (number of features to be learned) to be adjusted. But it is effective. Unlike other feature learning methods, sparse Filtering does not explicitly construct the input data distribution model. It only optimizes a simple cost function (L2 norm sparse constraint feature). The optimization process can be achieved through several lines of simple Matlab code. In addition, sparse filtering can easily and effectively process high-dimensional input and expand to multiple layers of stacking.

The core idea of the sparse filtering method is to avoid explicit modeling of data distribution, but to optimize the sparsity of feature distribution for better feature expression.

**1. unsupervised feature learning**

Generally, most feature learning methods attempt to model the true distribution of given training data. In other words, feature learning is to learn a model that describes an approximation of the true distribution of data. These methods include denoising autoencoders, restricted Boltzmann machines (RBMS), independent component analysis (ICA), and sparse coding.

These methods work well, but the annoying thing is that they all need to adjust many parameters. For example, learning rates, momentum (as needed in RBM), sparsity penalties, and weight decay. The final determination of these parameters needs to be obtained through cross-validation. It takes a long time to train such a structure, and so many parameters need to be obtained through cross-validation. We spent a lot of effort adjusting to get a set of good parameters, but for another task, we have to adjust another set of good parameters, so it will take too much time for the developers. Although ICA only needs to adjust one parameter, its scalability is weak for high-dimensional input or large feature sets.

In this article, our goal is to study a simple and effective feature learning algorithm, which requires only the least parameter adjustment. Although the model for learning data distribution is desirable and has good results, it will often complicate the learning algorithm. For example, RBMS requires an approximate logarithm to divide the gradient of the log-partition function, in this way, the likelihood function of data can be optimized. Sparse Coding needs to find an active base coefficient in each iteration, which is time consuming. The sparse factor is also a parameter to be adjusted. This method directly analyzes and optimizes the distribution of features by bypassing the estimation of data distribution. So what kind of feature distribution is optimal? Here, we need to focus on some of the main attributes of features: Population sparsity, lifetime sparsity and high dispersal. What features are good features and good features for classification or other tasks. Our learning algorithms should learn to extract such features.

**Ii. feature distribution**

The feature learning algorithms discussed above can be considered to generate specific feature distribution feature distributions. For example, Sparse Coding uses only a few non-zero coefficients (features) to describe each sample. A feature-oriented distribution method can be considered to directly optimize some attributes of feature distribution so that it can better describe samples.

We introduce a feature distribution matrix. each row of the matrix is a feature, and each column is a sample. Each element indicates the activation value of the j feature of the I sample. The above analysis shows that this is a ing function (Feature Extraction function) from input to feature.

Next we will discuss what kind of feature distribution is good:

**1) features of each sample should be sparse (population sparsity)**

Each sample is described with only a few activation (non-zero) features. Specifically, each column of the feature matrix (a sample) f (I) has few non-zero elements. All others are 0. For example, an image can be described by some targets. If there are many possible targets in the image, there may only be some at a certain time. We call it population sparsity (sparse population ).

**2) features between samples should be sparse (lifetime sparsity)**

Good features should be differentiated so that samples can be differentiated. For example, to distinguish between a human face and a human hand, it is obvious that the skin color is not a distinctive feature, because both the human face and the human hand have skin color. However, if you have eyes, you can easily distinguish between human faces and human hands. Therefore, eyes are a distinctive feature. Therefore, to distinguish samples, we must select the unique characteristics of the samples, rather than all features. A slightly academic expression means that each feature can only be activated in a small number of samples. That is to say, in the feature matrix, each row (a feature) should have only a few non-zero elements. The attribute of this feature is called Lifetime sparsity (sparse ).

**3) feature distribution should be even (high dispersal)**

The feature distribution of each row (a feature with different values in different samples) should be similar to that of other rows, or each feature should have similar statistical characteristics. Specifically, for each row of the matrix, we take the square mean of all elements of the row (a feature with different values in different samples) as the description of its statistical characteristics. Each row has an average value, so the average value of each row should be the same, so that all features can be considered to have a similar distribution. This attribute is called high dispersal (high dispersion ). However, this attribute is not necessary for a good feature description. However, it can prevent feature degradation, that is, it can prevent the same features from being extracted (if the same features are extracted, the features are redundant without increasing the amount of information, generally, the extracted features are orthogonal ). Complete feature expression. High dispersal can be understood as having few inactive features. For example, PCA encoding generally does not meet high dispersal, because most feature vectors (that is, feature code) corresponding to large feature values are always active.

Many feature learning methods actually contain these constraints. For example, sparse RBM restricts a feature's activation value to a target value (lifetime sparsity ). ICA normalizes each feature and optimizes its lifetime sparsity. Sparse autoencoder will also explicitly optimize the life time sparsity.

In addition, clustering-based algorithms, such as K-means, are an extreme form of population sparsity constraints. One clustering center corresponds to only one feature. For each sample, only one feature is activated (only one value is 1, and the others are all 0 ). The triangle activation function also guarantees population sparsity. Sparse Coding can also be considered to have population sparsity.

In this paper, we derive a simple feature learning algorithm from the perspective of feature distribution. It only needs to optimize high dispersal and population sparsity. In our experiments, we found that the attribute of implementing these two features is sufficient for learning a complete feature expression. In the future, we will explain that the combination of these two attributes actually contains the life time sparsity to ensure features.

**Iii. sparse Filtering**

The following describes how sparse filtering captures the features mentioned above. We will first consider calculating linear features from each sample. Specifically, we represent the J-th feature value of the I-th sample (column I in the feature matrix) (Row J in the feature matrix ). Because it is a linear feature, so. In the first step, we first normalize the rows in the feature matrix, then normalize the columns, and then sum the absolute values of all elements in the matrix.

Specifically, we first normalize each feature to an equal activation value. The specific method is to divide each feature by its two-norm in all samples :. Then we normalize the features of each sample. In this way, they will fall on the second-norm unit sphere L2-ball. The specific method is :. At this time, we can optimize these normalized features. We use l1 norm penalty to constrain sparsity. For a dataset with m samples, the target function of sparse filtering is:

**3.1. Optimizing for population sparsity**

This metric measures the population sparsity of the features of the I sample, that is, it limits each sample to only a few non-zero values. Because normalized features are constrained to fall on the second-norm unit sphere, when these features are sparse, that is, when the samples are close to the feature axis, the above target function will be minimized. Conversely, if each feature value of a sample is similar, it will lead to a very high penalty. It may be a bit difficult to understand. Let's look:

Left: assume that the feature dimension is two dimensions (F1, F2). We have two samples, green and brown. Each sample is first projected onto a two-norm sphere (two-dimensional unit circle) before sparse optimization. We can see that when the sample falls on the coordinate axis, the feature has the largest sparsity (for example, if a sample falls on the F2 axis, the representation of this sample is (0, 1 ), if one feature value is 1 and the other is 0, it is obvious that it has the largest sparsity ). Picture on the right: Because of normalization, there will be competition between features. There is a sample added only to the F1 feature. As you can see, although it only increases in the F1 direction (the green triangle is transferred to the blue triangle), after column normalization (projected to the unit circle ), we can see that the second feature F2 is reduced (the Green Circle is transferred to the Blue Circle ). That is to say, there is competition between features. If I grow bigger, you have to become smaller.

One attribute of feature normalization is that it implicitly competes between features. Normalization will increase if only one feature component f (I), then the values of all other feature components will decrease. Similarly, if only one feature component f (I) decreases, the values of all other feature components increase. Therefore, our minimization will drive normalized features to become sparse and mostly close to 0. That is, some features are relatively large, and other feature values are very small (close to 0 ). Therefore, this objective function optimizes the population sparsity of features.

The formula above is very similar to the measurement of Treves-rolls's population/life-time sparsity:

F indicates the number of features. This metric is often used to measure the sparsity of activation of brain neurons. The formula we proposed can be regarded as the square root of the measurement, multiplied by a scale factor.

**3.2. Optimizing for high Dispersal**

As mentioned above, the high dispersal attribute of a feature requires that each feature be permanently active. Here, we rudely force the mean value after the square of each feature's activation value to be equal. In the sparse filtering formula above, we first normalize each feature by dividing each feature by its two norm on all samples so that they have the same activation value :. In fact, it has the same effect as the square expected value of each feature ., Therefore, it has implicitly optimized the high dispersal attribute.

**3.3. Optimizing for lifetime sparsity**

We found that the optimization of Population sparsity and high dispersal has implicitly optimized the feature's lifetime sparsity. What is the reason for this. First, a feature with population sparsity will have many inactive elements (0 elements) in the feature matrix ). Moreover, because high dispersal is satisfied, these zero elements are evenly distributed across all features. Therefore, each feature must have a certain number of zero elements, thus ensuring the lifetime sparsity. Therefore, the optimization of Population sparsity and high dispersal is enough to get a good feature description.

**4. Deep sparse Filtering**

Because the target function of sparse filtering is unknown, We can freely select a forward network to calculate these features. Generally, we use complex non-linear functions, such:

Or Multi-Layer Networks to calculate these features. In this way, sparse filtering is also a natural framework for deep network training.

Deep networks with sparse filtering can be trained using authoritative layer-by-layer greedy algorithms. We can first use sparse filtering to train a single-layer normalized feature, and then use it as the second-layer input to train the second-layer feature. In the experiment, we found that it can learn meaningful feature expressions.

**V. Experiment and Analysis**

In our experiment, we use the soft absolute value function as our activation function:

Algorithm = 10 ^-8, and then use the existing L-BFGS algorithm to optimize the target function of sparse filtering until convergence.

For other experiment results, see the original article.

**Vi. Summary and discussion**

**6.1 contact with divisive Normalization**

The population sparsity of sparse filtering is closely related to divisive normalization. Divisive normalization is a visual processing process in which the response of a neuron is divided by the sum (or weighted sum) of the response of all neurons in its neighborhood ). Divisive normalization is very effective in multi-level object recognition. However, it is introduced as a preprocessing phase instead of being part of unsupervised learning (pre-training. In fact, sparse filtering combines divisive normalization into feature learning, and learns the feature expression that satisfies population sparse by making features compete.

**6.2. Contact with ICA and Sparse Coding**

The target function of sparse filtering shows the normalized version of the ICA target function. In Ica, the target function is to minimize the response of a linear filter group, for example, |**WX**| 1, but must satisfy the orthogonal constraints between filters. Orthogonal constraints ensure that the features we learned are different. In sparse filtering, we use a normalized sparse penalty to replace this constraint. The response of each filter is divided by the two-norm of all filter responses. |**WX**| 1/|**WX**| 2. Here, we introduce competition to the filter without adding orthogonal.

Similarly, we can apply the normalization idea To The Sparse Coding Framework. In Sparse Coding, sparse penalties similar to L1/L2 have been used. We use a normalized penalty, such as |**S**| 1/|**S**| 2 to replace the penalty of the general l1 norm, for example |**S**| 1. The normalized penalty has scale immutability and can be more robust to data changes.