Data preprocessing Contents [hide] 1 Overview 2 Data Normalization 2.1 Simple zoom 2.2 per sample mean Cut 2.3 feature standardization 3pca/zca Albino 3.1 based on the reconstructed model 3.2 based on the orthogonal ICA Model 4 large Image 5 standard Flow 5.1 Natural grayscale Image 5.2 Color Image 5.3 Audio (mfcc/spectrum) 5.4MNIST handwritten digit 6 Chinese-English translator overview
Data preprocessing plays an important role in many algorithms of depth learning, in fact, many algorithms can play the best effect after normalization and whitening of data. However, unless you have a wealth of experience with these algorithms, the precise parameters of preprocessing are not obvious. On this page, we want to be able to uncover the mystery of preprocessing methods while providing the skills (and standard processes) for preprocessing data.
Tip: When we start working with data, the first thing to do is to look at the data and learn about its characteristics. In this section, we will introduce some common techniques, in practice, we should choose the appropriate preprocessing technology for specific data. For example, a standard preprocessing method is to subtract its mean value for each data point (also known as removing the DC component, local mean reduction, reduced normalization), this method is effective for data such as natural images, but not for non-stationary data.
Normalization of data
In data preprocessing, the first step of standard is data normalization. Although there are a number of possible ways to do this, this step is usually explicitly chosen based on the specifics of the data. The commonly used methods of feature normalization include the following: simplified reduction of sample mean reduction (also called DC Component removal) feature normalization (so that all features in the dataset have 0 mean and unit variance)
Simple scaling
In simple scaling, our goal is to adjust the values of each dimension of the data (which may be independent of each other) so that the final data vector falls within [0,1] or [−1,1] intervals (depending on the data). This is important for subsequent processing, because many default parameters (such as the Epsilon in pca-albinism) assume that the data has been scaled to a reasonable interval.
Example: When dealing with natural images, the pixel values we get are in [0,255] intervals, and the usual processing is to divide the pixel values by 255 to make them scale to [0,1].
Reduction of the mean value on a per-sample basis
If your data is stationary (that is, the statistics for each dimension of the data are subject to the same distribution), you can consider subtracting the statistical average of the data on each sample (calculated by sample).
Example: For images, this normalization can remove the average brightness value (intensity) of the image. In many cases, we are not interested in the illuminance of the image, and we focus more on its content, then it is meaningful to remove the mean value of pixels for each data point. Note: Although this method is widely used in images, it is necessary to be careful when dealing with color images, specifically because the pixels in different color channels do not all have stationary properties.
Feature standardization
Feature normalization refers to (independently) that each dimension of the data has 0 mean and unit variance. This is the most common method of normalization and is widely used (for example, when using support vector machines (SVM), feature normalization is often recommended as part of preprocessing). In practice, the characteristic standardization is calculated by first calculating the mean value of the data on each dimension (using the whole data calculation), then subtracting the mean value on each dimension. The next step is to divide the data on each dimension by dividing the standard deviation of the data on that dimension.
Example: When processing audio data, the Mel frequency coefficient Mfccs is used to characterize the data. However, the first component of the MFCC feature (representing the DC component) is too large, often masking other components. In this case, in order to balance the effects of the various components, standardized processing is usually used independently of each component of the feature.
Pca/zca Albino
After simple normalization, albinism is usually used as the next preprocessing step, which makes our algorithms work better. In fact, many depth learning algorithms rely on albinism to get good features.
In Pca/zca bleaching, it is necessary to first make the feature 0 mean, which is guaranteed. In particular, this step needs to be done before the covariance matrix is computed. (The only exception is that the mean-by-sample reduction has been done and the data is stationary on all dimensions or pixels.) )
Next in Pca/zca albinism we need to choose the right epsilon (recall that this is a regular term, which has a low-pass filtering effect on the data). Selecting the appropriate epsilon value plays a major role in feature learning, and the following discusses how to choose epsilon in two different situations:
A model based on refactoring
In the model based on refactoring (including self encoder, sparse coding, limited Boltzman machine (RBM), K-Means (K-means)), it is often preferred to select suitable epsilon to make albinism achieve low-pass filtering effect. It is generally believed that the high-frequency component of the data is the noise, and the function of low-pass filtering is to restrain the noise as much as possible while preserving the useful information. In PCA and other methods, the information of hypothesis data is mainly distributed in the direction of higher variance, and the direction of lower variance is noise (i.e. high frequency component), therefore, the selection of Epsilon is related to the characteristic value in the later text. One way to test whether Epsilon is appropriate is to use this value to zca the data and then visualize the data before and after the albinism. If the epsilon value is too low, the whitening data will appear to be noisy; Conversely, if the epsilon value is too high, the albino data is too blurred compared to the original data. An intuitive way to get the epsilon size is to draw the eigenvalues of the data graphically, as shown in the example below, where you can see a "long tail" that corresponds to the high-frequency noise portion of the data. You need to choose the right epsilon so that it can filter out the "long tail" to a large extent, that is, the selected epsilon should be larger than most of the smaller ones that reflect the noise in the data.
In a reconstructed model, one of the loss functions is used to punish those refactoring results that differ significantly from the original input data (for example, an automatic encoder requires that input data be encoded and decoded as much as possible to restore input data). If the epsilon is too small, the whitening data will contain a lot of noise, and the model should fit these noises to achieve good refactoring results. Therefore, for the model based on refactoring, it is very important to carry out low-pass filtering to the original data.
Tip: If the data has been scaled to a reasonable range (such as [0,1]), you can start adjusting epsilon from epsilon = 0.01 or epsilon = 0.1.
A model based on orthogonal ICA
For the model based on orthogonal ICA, it is very important to ensure that the input data is whitened as much as possible (i.e. the covariance matrix is the unit matrix). This is because this type of model needs to be orthogonal to the learning features to unlock the dependencies between the different dimensions (see ICA section for more information). So in this case, the epsilon should be small enough (e.g. epsilon = 1e−6).
Hint: We can also reduce the dimension of the data in the PCA whitening process. This is a good idea, because it can greatly increase the speed of the algorithm (reduce the number of operations and parameters). There is a rule of thumb for determining the number of principal components to retain: The total variance of the retained component is over 99% of the total sample variance. (For more information please refer to PCA)
Note: When using the taxonomy framework, we should only compute the PCA/ZCA whitening matrix based on the data on the practice set. The following two parameters need to be saved for the test set: (a) the average vector for 0-valued data, and (b) the albino matrix. The test set needs to use both sets of stored parameters for the same preprocessing.
Large image
For large images, it is impractical to adopt an PCA/ZCA based whitening method because the covariance matrix is too large. In these cases we retire and use the 1/f whitening method (more content to be followed later).
Standard process
In this section, we will introduce several preprocessing standard processes that perform well on some datasets.
Natural grayscale Images
Gray-scale image has a stable character, we usually in the first step for each data sample to do the mean reduction (that is, minus the DC component), and then use Pca/zca whitening treatment, of which the epsilon is large enough to achieve the effect of low-pass filtering.
Color image
For color images, there is no stationary characteristic between color channels. So we typically scale the data first (so that the pixel value is in the [0,1] range), and then use a large enough epsilon to do the PCA/ZCA. It is necessary to attribute the component mean to zero before the PCA transform.
Audio (mfcc/spectrum map)
For audio data (MFCC and Spectrogram), the range of values (variance) for each dimension is different. For example, the first component of MFCC is the DC component, which is usually much larger than the other components, especially when the feature contains a time domain derivative (temporal derivatives) (this is a common method in audio processing). Therefore, preprocessing of such data usually begins with a simple data normalization (that is, each dimension of the data is zero and the variance is 1) and then Pca/zca albinism (using the appropriate epsilon).
Mnist handwritten digits
The pixel value of the Mnist dataset is in the [0,255] interval. We first scale it to the [0,1] interval. In fact, doing a sample-by-value elimination also helps to characterize learning. Note: Mnist can also be selected for Pca/zca bleaching, but this is not commonly used in practice.
Normalized normalization Albino whitening DC component local mean reduction localized mean subtraction reduction normalized sparse autoencoder scaling rescaling A sample mean reduction per-example mean subtraction characteristic standardization feature standardization stationary stationary Mel frequency coefficient MFCC 0 mean value Zero-mean low-pass filter Low-pas s filtering model reconstruction based models Self encoder autoencoders Sparse coding sparse coding restricted Boltzman machine Rbms K-mean K-means long tailed long Tail loss functions loss function orthogonal orthogonalization FROM:HTTP://UFLDL.STANFORD.EDU/WIKI/INDEX.PHP/%E6%95%B0%E6%8D%AE%E9%A2 %84%e5%a4%84%e7%90%86