9 Anomaly Detection
9.1 Density Estimation
9.1.1 Problem Motivation
Anomaly detection (Density estimation) is a common application of machine learning and is mainly used for unsupervised learning, but in some ways it is similar to supervised learning.
The most common application of anomaly detection is fraud detection and in the industrial production field.
In particular, the industrial production of aircraft engine example: the characteristic of this hypothesis is only 2, for different training sets of data for coordinate drawing, prediction model P (x) and threshold ε. for a new test case xtest, if P (xtest) <ε, the instance is predicted to have an error, otherwise the instance is expected to be normal.
To do a problem:
B
9.1.2 Gaussian Distribution
Recognize a Gaussian distribution, or a normal distribution. Learn to calculate the parameters in the Gaussian distribution formula.
X~n (μ,σ^2) means that x belongs to the Gaussian distribution, Σ is the standard deviation, and σ^2 is the variance. meaning in Gaussian distribution: the center position of μ control articulator; σ controls the width of the graph. p (x;μ;σ^2) indicates that the probability distribution of x is controlled by the μ and σ^2 two parameters, which represent the probability density function of the Gaussian distribution. Note that the integral of the shadow area of the Gaussian distribution is 1.
The following is a comparison of Gaussian images under different μ,σ values:
Parameter estimation problems. In this case, the parameter estimation is the given data set, which can estimate the values of μ and σ^2, which is the maximum likelihood estimate of μ and σ.
Μ=1/M*∑MX (i)
Σ=1/m*∑m (x (i)-μ), here m in the statistical formula writing m-1, machine learning written m, and finally do not affect the effect.
Note: The above μ, Σ, and X (i) are vectorized.
To do a problem:
9.1.3 algorithm
The anomaly detection algorithm is developed by Gaussian distribution.
An untagged training set with a total of M training samples. The probability model of P (x) is used to calculate the probability of the occurrence of these features is higher. P (x) corresponds to a separate hypothesis from X1 to xn (xi is the first dimension of the eigenvector), in fact, if these samples are not independent, the effect of the algorithm is still good. It is estimated that P (x) is the density estimation problem.
Anomaly Detection Algorithm Description:
1. Select the characteristic XI that is considered to reflect the exception example.
2. Using training sets and formulas to fit parameters μ and σ, different characteristics correspond to different parameters and Gaussian distributions.
3. For a new example X, using the Gaussian distribution formula, fitted parameters, and P (x) formulas to calculate P (x), p (x) < threshold ε, the example is considered an anomaly.
To do a problem:
D
9.2 Building an Anomaly Detection System
9.2.1 developing and evaluating an Anomaly Detection System
Using the example of a specific detection algorithm to show how to use real-number evaluation method to evaluate an anomaly detection algorithm step.
1. Clear data distribution. 6:2:2 training set, cross-validation set, and test set division. Pre-tagged data, which is considered unmarked at the time of training, can be used in cross-validation sets to measure algorithms by using the tag selection feature and the ε parameter when testing.
The data for training sets, cross-validation sets, and test sets must be independent of each other and not cross, and the following data distribution is wrong:
2. Select the evaluation indicator (no accuracy is due to the existence of a skew rate issue).
In this example, one way to select ε is to cycle through the cross-validation set to Select the f1-integral with the largest ε under different ε. Cross-validation sets can be used to assist in making decisions, such as determining how well the ε is appropriate to quickly determine σ and which features should be included in the algorithm's best results.
The implementation of the specific algorithm is as follows:
3. Finally, a test set is used to evaluate the algorithm.
To do a problem:
C skew rate.
9.2.2
Discuss when to use anomaly detection and when to use supervised learning.
Anomaly detection and supervised learning are different in sample distribution. Anomaly detection: The positive examples of Y=1 are very few (compared to the negative case of y=0), the new different types of positive examples will also appear (the previous sample did not appear), the current normal sample does not cover all possible positive cases; supervised learning: The number of positive and negative cases is similar, in the sample is basically covered in the case of both positive and negative examples.
Anomaly detection and supervised learning in different scenarios: as the number of samples and negative sample changes, anomaly detection can be converted to supervised learning problems.
To do a problem:
BD
9.2.3 Choosing what Features
One of the factors that affect the efficiency of anomaly detection algorithm is how to select the Eigenvector to enter the anomaly detection algorithm. The following discusses how to design or select a feature variable for an anomaly detection algorithm. The following two steps can be performed alternately.
1. Transform the form of the data, so that the distribution of the characteristics of each dimension obeys the normal distribution (Gaussian distribution): To get the data to the eigenvector for each dimension of the plot, to see whether each dimension looks like a Gaussian distribution, if the data of a certain dimension distribution is not like Gaussian distribution, the use of logarithmic or root index to convert the dimension data The distribution of the transformed data is characterized by a Gaussian distribution, and then the converted new data is used in the algorithm. You can use the Hist () function to form a histogram in MATLAB.
2. Select the feature variable for anomaly detection: Through an error analysis, this is similar to the previous error analysis step of learning to supervise the learning algorithm, first complete training of a learning algorithm, and then a set of cross-validation set to run the algorithm, to find out the samples of the prediction errors, and then see if we can find some other characteristic variables, To help with learning, so that it behaves better in samples that are judged to be wrong when cross-validation occurs.
If the existing feature is not sufficient to determine the exception instance, you can combine the feature variable to catch the exception. For example, the following is an example of network Center detection: When the network is blocked, for a particular machine, in addition to the CPU load is large, other conditions are normal, then you can assemble a new feature variable x5= (CPU load)/(network traffic) to reflect network Traffic is normal and CPU load is large.
To do a problem:
B
9.3 Multivariate Gaussian distribution (Optional)
9.3.1 Multivariate Gaussian distribution
This paper introduces the multivariate Gaussian distribution (multivariate Gaussian distribution): Improved anomaly detection, which can catch some anomalies not detected by previous algorithms.
For example, the last section, "about the network Center detection example: when the network is blocked, for a particular machine, in addition to the CPU load, and other normal conditions, then you can assemble a new feature variable x5= (CPU load)/(network traffic) to reflect network Traffic is normal and CPU load is large. "
Multi-Gaussian distribution method: The Gaussian functions of different characteristic variables are no longer constructed separately, and P (x) is constructed once, where the parameters are covariance matrix σ, which is the correlation between μ and descriptive variables. which
Take a look at the effects of different μ and σ on multivariate Gaussian distribution graphs:
9.3.2
Applying the multivariate Gaussian distribution to anomaly detection:
1. Calculate and
2. Calculate the P (x) of the new instance x, compare the size of P (x) and predict if it is abnormal.
Contact with the original model:
1. When the covariance matrix is a diagonal matrix, the detection formula of the anomaly detection algorithm using the multivariate Gaussian distribution and the non-multivariate Gaussian distribution is the same.
2.1 The original model manually creates new feature variables to calculate outliers, and the new model automatically calculates the correlations between the different features.
2.2 The original model calculation is small, it is suitable for large-scale characteristic variable (n Large), and the new model calculation cost is great.
2.3 The original model can still be used when m is small, and the new model requires that the sample quantity m be greater than the number of characteristic variable n, because the σ matrix is reversible. In practical applications, when M is far greater than N, it is almost m>=10n and uses multiple Gaussian distributions.
In practical applications, the original model is more commonly used, the average person will manually add additional variables.
If the σ matrix is found to be irreversible in practical applications, there are 2 possible reasons for this:
1. The condition of M greater than N is not satisfied.
2. There are redundant variables (at least 2 variables are exactly the same, XI=XJ,XK=XI+XJ). is actually caused by the linear correlation of the characteristic variables.
To do a problem:
Acd
Practice:
The following 2 questions do wrong, do not know the correct answer:
Coursera Machine Learning Chapter 9th (top) Anomaly Detection study notes