This article is reproduced from Cador"Anomaly detection using R language"
This article combines the R language to show the case of anomaly detection, the main contents are as follows:
(1) Anomaly detection of single variables
(2) Anomaly detection using LOF (local outlier factor, localized anomaly factor)
(3) Anomaly detection by clustering
(4) Anomaly detection of time series
One, single variable anomaly detection
This section shows an example of a univariate anomaly detection and demonstrates how to apply this method to multivariate data. In this example, univariate anomaly detection is implemented by the Boxplot.stats () function and returns the statistics that generate the box plot. In the returned result, one part is out and it has a list of outliers. More specifically, it lists the whiskers that lie outside the extremum. The parameter Coef can control how far the beard extends beyond the box-line diagram. In R, run the Boxplot.stats to get more detailed information.
Presents a box-line chart with four laps that are outliers.
The single-variable anomaly detection above can be used to discover outliers in multivariate data through a simple collocation approach. In the following example, we first produce a data frame df, which has two columns x and Y. The outliers are then detected from X and Y, respectively. We then get the data that both columns are outliers as exception data.
In, the outliers are marked with "+" in red
Similarly, we can also mark the data of an outlier as an outlier by x or Y. , the exception value is marked with ' X ' as blue.
When there are more than three variables, the final outliers need to take into account the majority vote of the results of single-variable anomaly detection. When choosing the best way to match in real-world applications, you need to involve domain knowledge.
Second, using LOF (local outlier factor, regional anomaly factor) for anomaly detection
LOF (local anomaly factor) is an algorithm used to identify local outliers based on density. With Lof, the local density of a point is compared to its neighbors. If the former is significantly lower than the latter (with a LOF value greater than 1), the point is in a sparse region, which, for its neighbors, indicates that the point is an outlier. The disadvantage of Lof is that it is only valid for numeric data.
The Lofactor () function calculates the local exception factor using the LOF algorithm, and it is available in the DMWR and Dprep packages. Here is an example of using LOF for anomaly detection, where k is the number of neighbors used to calculate the local anomaly factor. Presents a density map with an exception that deserves to be divided.
Next, we combine the first two main components of the double plotting to render outliers.
In the above code, PRCOMP () performs a principal component analysis, and Biplot () draws the data using the first two principal components. In, the X and Y axes represent the first and second main components respectively, the arrows represent the variables, and 5 outliers are marked with their line numbers.
We can also use Pairsplot to display outliers, where the outliers are marked red with "+".
Rlof package, parallel implementation of the LOF algorithm. Its usage is similar to Lofactor (), but Lof () has two additional features, which are several options that support K's multivariate value and distance measurement. Here is an example of Lof (). After the exception is calculated, the outliers can be detected by selecting the first few. Note that the current version of package Rlof works in MacOS X and Linux environments, but does not work in Windows environments because it relies on multicore packages for parallel computing.
Third, the anomaly detection by clustering
Another method of anomaly detection is clustering. By clustering data into classes, those data that are not part of the task are treated as outliers. For example, using density-based clustering dbscan, if objects are tightly connected in dense areas, they will be grouped into one class. Therefore, objects that are not divided into any class are outliers.
We can also use the K-means algorithm to detect exceptions. Using the K-means algorithm, the data is divided into K-groups by assigning them to the nearest cluster center. We are then able to calculate the distance (or similarity) of each object to the center of the cluster, and select the maximum distance as the outlier value.
The following is an implementation of anomaly detection on Iris data based on the K-means algorithm.
In, the cluster Center is marked as an asterisk, and the outliers are marked ' + '
Iv. anomaly detection of time series
This section describes an example of anomaly detection for time series data. In this example, the time series data is first used by the STL () for robust regression decomposition, and then the outliers are identified. For an introduction to STL, please visit http://cs.wellesley.edu/~cs315/Papers/stl%20statistical%20model.pdf.
In, the outliers are marked with red as ' X '
V. Discussion
The LOF algorithm specializes in detecting local outliers, but it is only valid for numeric data. The Rlof package relies on the multicore package, which fails in the Windows environment. A fast and stable anomaly detection strategy for categorical data is the AVF (Attribute Value Frequency) algorithm.
Some of the R packages used for anomaly detection include:
Extremevalues Package: Single Variable anomaly detection
Mvoutlier Package: Anomaly detection of multivariate variables based on stability method
Outliers package: test for outliers
Source: http://blog.163.com/shen_960124/blog/static/60730984201472603638628/
From for notes (Wiz)
"R Notes" for anomaly detection using R language