"Outlier detection" for K-Means optimization implemented in C 」,

Source: Internet
Author: User

"Outlier detection" for K-Means optimization implemented in C 」,
Preface

In the previous article, we used the most traditional K-Means clustering method. In the previous article, we only introduced some optimization methods, but did not specifically discuss how to optimize them. so in this blog post, I will share it with you. I learned about pre-processing before clustering-outlier detection.

Outlier Detection Method

Outlier detection is an important part of data mining. we also want to improve K-Means in the previous blog to cause great fluctuations due to exceptions. the outlier detection methods are diverse. in addition to the density-based outlier detection method-LOF, which I will share in this article. there are many other methods. here we will summarize some other types of outlier detection.

1. Statistical Distribution-based outlier detection

Based on the first hypothesis that the dataset conforms to a model such as (Poisson distribution/Gaussian distribution), the points with serious deviation from the statistical law are screened out.

2. Distance-based outlier detection method

This is the earliest Method Used for outlier detection. the specific principle is to specify the parameter p-Data Count/d-threshold distance. in a data set, if the distance between a point and a point p is greater than d, the point is regarded as an outlier.

3. Deviation-based outlier detection

Take a point that is far from the concentrated area as an outlier.

4. Deep-based outlier detection

Locate the data identified as an outlier Based on the depth (dimension.

5. Clustering-based outlier detection method

After improvement, the current clustering algorithm can define data that does not strictly belong to any cluster as an outlier.

Outlier Detection Method Based on the LOF algorithm

As mentioned above, in K-Means clustering. the Euclidean distance is used. Therefore, when the number of outlier data points in the sample set is not very rich, the final clustering result will be greatly affected. however, Group Separation values often have interesting features that deserve our analysis. therefore, we usually first remove all the group values before clustering for separate analysis. the outliers may be visually obvious. For example, the red dots in our view are the outliers.

The pre-processing before K-Means is to screen out these red points. Then we will use the LOF algorithm for the specific processing method. Next let's take a look at the LOF algorithm.

LOF algorithm

The full name of the LOF algorithm is the Local exception Factor algorithm, and the Local Outlier Factor (LOF ). most of the exception detection methods are determined based on distance. the distance-based discriminant method is applicable to the identification of abnormal points in two-dimensional or high-dimensional coordinate systems. LOF is also a factor that determines exceptions based on distance. some are determined based on distribution. Here we will introduce it a little. for example:

,

For the distribution discriminant method, we generally regard the diagonal line on the right as an exception, and return to LOF. what we want to achieve is like this.

,

We can find that the cluster density in the lower left corner is significantly higher than that in the upper right corner. what we use the LOF algorithm to achieve is how to identify the outliers of the two clusters with different density conditions.

LOF algorithm Definition

1. k? Distance (p), dk (p) for short: the k distance of point p

2. d (p, x): the distance between two p and x

3. Nk (p): k-nn

4. Reach-dist (p, x): reachable distance

5. lrdk (p): Local reachable Density

6. LOFk (p): local outlier factor

Let's take a look at the relationships between them and their contribution to the group values we want to calculate.

1. k? Distance (p), dk (p) for short: the k distance of point p

As shown in, simply put, it is the point far from the k of p.

The black spot in the figure is a point 6th away from the p point (the center of the Red Circle). The length indicated by the arrow is 6-distance (p) or d6 (p)

2. d (p, x): the Euclidean distance between the simplest two points, that is, the distance formula between the points p and x.

3. nk (p): The k-nn of point p. that is, all points within the k Distance of p, including the k Distance, because there may be multiple points with the same distance to the center. therefore, the number of k-th neighbor points of p | Nk (p) | ≥k, take> condition as shown in:

At this time, there are three points in d6 (p) on the circle. Therefore, we can see (| Nk (p) | = 8)> 6.

4. Reach-distk (p, x): reachable distance

Reach-distk (p, x) = max | dk (p), d (p, x) |

Let's explain the above formula in combination.

In the figure, P is used to replace the O point. therefore, we can obtain that the maximum K reachable distance from O point to X is the larger distance between O's K Distance and O/X. the nearest K points to the point O. The reachable distance from O to O is considered to be dK (O ). the reason is that the distance is smaller than dK (O.

We can see that the 6th reachable distance from O1 to X is

Reach-dist 6 = max | d6 (O1), dI (O1, X) | = d6 (O1 ).

The 6th reachable distance from O0 to X is Reach-dist 6 = max | d6 (O0), d (O0, X) | = d6 (O0 ).

5. lrdk (p): Local reachable Density

This formula indicates the reciprocal of the average reachable distance between a vertex in the k-th neighborhood of point p.

It is defined as the reachable distance between Nk (p) and p in the neighborhood of p.

Lrdk (p) indicates a density. The higher the density, the more likely it is to be the same cluster. the lower the density, the higher the probability of outlier. in this case. if our p and the neighboring points are in the same cluster, the calculated reachable distance is more likely to be smaller dk (p ), because the local reachable density is the reciprocal of Reach-dist (p), the density value of p is higher when it is the same cluster as the neighboring points. if p is far away from the neighboring points, the reachable distance may be greater than d (p, X). The larger the average Reach-dist (p), the smaller the local reachable density, therefore, it is highly likely to be an outlier.

6. LOFk (p): local outlier factor

The formula for calculating the local outlier factor is as follows:

The local outlier factor can indicate the level of data exception. The magnitude of the anomaly level indicates the isolation level of data points in the data field.

This formula represents the mean of the ratio of the local reachable density of the Nk (X) neighbor of the vertex X to the local reachable density of the vertex X.

If the ratio is closer to 1, it indicates that X is similar to the interior point density in its neighbor. X may belong to the same cluster as the neighbor. If the ratio is smaller than 1, it indicates that the density of X is higher than the density of the neighboring point, and X is the dense point. If the ratio is greater than 1, it indicates that the density of X is less than the density of the neighboring point, the larger the value of X, the more likely it is an abnormal outlier.

Advantages and disadvantages of the LOF algorithm

Advantages

You only need to set k values, which is easier to implement. Because the algorithm is density-based, we analyze X and the K points around it, this reduces the impact of extremely high density and low density on overall dataset analysis. make the results more accurate.

Disadvantages

If the dataset is determined, the determination of the outlier is related to the selection of K. When k is selected differently, the outlier will change. Therefore, we need to constantly adjust the parameters to make the results conform to expectations.

LOF algorithm implementation steps

The algorithm implementation process is as follows:

(1) Select the appropriate K

(2) obtain the neighbor of each data object in the dataset, and calculate the distance between each data object and all data object points in the field by Euclidean distance.

(3) Based on the results in (1), the local outlier factors of all data objects are derived by traversing all Data Objects Based on the above formula.

(4) based on the defined threshold, we can find the data points with the local outlier factor exceeding our set threshold and use them as the outliers.

Below is the code for implementing the LOF algorithm in C #.
Private void LOFDist () {// xlxw. ver LOFk = 20; // k maxLof = 3; // threshold value NkLOF = new int [RowCount, LOFk]; lrdk = new double [RowCount]; LOFK = new double [RowCount]; double tmpLofSum = 0; for (int I = 0; I <RowCount-1; I ++) {Nk (I ); // determine the data field Nk (p) and calculate lrdk} richTextBox1.Text + = "\ nLOF computing"; // calculate LOFK for (int I = 0; I <RowCount; I ++) {tmpLofSum = 0; for (int j = 0; j <LOFk; j ++) {tmpLofSum + = lrdk [NkLOF [I, j]/lrdk [I];} LOFK [I] = tmpLofSum/LOFk;} richTextBox1.Text + = "\ n analyze the outlier :"; // LOFK is displayed in ListView for (int I = 0; I <RowCount-1; I ++) {if (LOFK [I]> maxLof) richTextBox1.Text + = "\ n no." + (I + 1 ). toString () + "data"; XlsDataSh. items [I]. subItems [8]. text = LOFK [I]. toString () ;}} private void Nk (int DataPtS) {double [] TmpDis = new double [RowCount]; double tmpdis = 0, tmpNpKMax = 0, tmpSum = 0; // calculate all the distances for (int I = 0; I <RowCount-1; I ++) {tmpdis = 0; for (int j = 3; j <7; j ++) {tmpdis + = Math. pow (System. convert. toDouble (XlsDataSh. items [I]. subItems [j]. text)-(System. convert. toDouble (XlsDataSh. items [DataPtS]. subItems [j]. text), 2);} TmpDis [I] = Math. pow (tmpdis, 1.0/2);} double [] copy = new double [TmpDis. length]; TmpDis. copyTo (copy, 0); // determine the data domain Nk (p) for (int I = 0; I <LOFk; I ++) {tmpNpKMax = 0; for (int j = 0; (j <RowCount-1); j ++) {if (copy [j]> tmpNpKMax) {tmpNpKMax = copy [j]; nkLOF [DataPtS, I] = j ;}} copy [NkLOF [DataPtS, I] = 0 ;}// calculate lrdk for (int I = 0; I <LOFk; I ++) {tmpSum + = TmpDis [NkLOF [DataPtS, I];} lrdk [DataPtS] = LOFk/tmpSum ;}}
Code comment:

1. as in the previous article, the dataset in Excel is used here. the data is exported from the Excel Xls file to the ListView in the C # control. The control name is XlsDataSh. You can see the control name in the program code.

2. 143rd data points are all set to 0 as the Monitored object to detect the LOF algorithm.

3. here, the K value (neighborhood) is set to 20. in the previous blog, we can see that the number of groups I set is 5, so in a class with more than one hundred people, I think 20 is a suitable K value for grouping.

4. I set the threshold value 3. The normal data point should be within the range of 1. to highlight the degree of exception, I set the threshold value 3.

Run

From figure 1, we can demonstrate our previous conclusions. normal data points should be within the range of 1. we can see from the exception point in Figure 2-data point 143rd. its outlier factor is 5 much greater than 1, which is also greater than the threshold we set, so we can conclude that this must be an outlier.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.