Evaluating the importance of an anomaly detection algorithm using numerical values
It is important to use the real-number evaluation method , when you use an algorithm to develop a specific machine learning application, you often need to make a lot of decisions, such as the choice of what characteristics and so on, if you can find how to evaluate the algorithm, directly return a real number to tell you the good or bad of the algorithm, That makes it easier for you to make a decision. If there is a feature now, do you want to consider this feature? If you take this feature to run your algorithm, and then remove this feature to run your algorithm and get the returned real number, the real number directly tells you whether the feature algorithm is getting better or worse, so you have a simpler algorithm to determine if you want to add this feature.
In order to develop an anomaly detection system more quickly, it is better to find some way to evaluate the anomaly detection system.
In order to evaluate an anomaly detection system, it is assumed that there are some tagged data, with normal samples and abnormal samples (normal sample y=0, abnormal sample Y=1)
For the training set , we still look at the untagged sample as a sample without exception (where there may be some abnormal samples being put into the training set)
Defines the cross-validation set and the test set , and obtains the anomaly detection algorithm through these two sets. We assume that both the cross-validation set and the sample in the test set are abnormal , that is, the sample Y=1 inside the test set (representing the exception sample).
A concrete example
There are 10,000 normal aircraft engines, with 20 problematic aircraft engines, and in the past experience, no matter how many years the aircraft engine plant has been built, it will get about 20 problematic engines. For exception detection typical applications, the number of exception samples is usually 20-50 such number, and usually our normal sample number is much larger.
We divide the data into training sets, cross-validation sets, and test sets, typically by dividing 6,000 of the 10,000 good engine samples into trainning set as untagged data (which is actually a normal sample), 2000 samples from the remaining normal samples were placed in the cross-validation set, and 2000 samples were put into the test set (the distribution ratio of the normal sample was 6:2:2); there would be 20 samples of the anomaly, of which ten were placed in the CV, the other 10 into the test.
Another common allocation method ( not recommended ) is to mix the CV with test samples and use the remaining 4,000 good samples as CVS and test (not recommended)
The derivation and evaluation algorithm of anomaly detection algorithm is as follows: first we use the training samples (although all are non-labeled samples but are actually normal samples) to fit the analog p (x)(that is, the parameter estimates the value of u,σ)
For the data in CV and test, we use the algorithm to predict Y and then evaluate the prediction accuracy. How to measure it?
Because the data is very skewed (more normal data and less unusual data), the classification accuracy rate is not a good measure, the precision, recall, and F1 values are calculated, and these methods are used to evaluate the performance of my anomaly detection algorithm in CV and test.
How to determine the value of ε? on the cross-validation set , decide what ε value to take, try several different ε values on the cross-validation set, and then choose the value of the ε that makes the F1 value the largest, that is, the value of the ε that is best performed on the cross-validation set. When we need to make a decision (such as which feature to choose, which ε value to choose), we can constantly use cross-validation to evaluate the quality of the algorithm, and then decide which feature we want to choose, which ε value.
When the value of ε is found, our anomaly detection algorithm is determined, and then the test set is used to evaluate the performance of the algorithm.
Summarize
How to develop an anomaly detection system: How to evaluate an anomaly detection algorithm