Detection of local outliers based on density

Last Update:2015-10-22 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Today we introduce the anomaly detection based on density, we take the Lof algorithm as an example . Look first.

C1 and C2 for two cluster,cluster inside the density are very high, but different cluster point of the distance is not quite the same. b relative to C1 is a partial (local) outlier. A is a outlier of the whole (global). So the outlier detection here is not based on distance or near-neighbor. but rather the difference between the density of points and points . if a point and the surrounding points are significantly smaller than the density, then the point is likely to be outlier!

The question is how to define density? To elicit the concept of density. Here we introduce reachability distance. the definition of this distance is this:

This distance has a directional a-->b. It takes the distance of the K nearest neighbor of B and the maximum between A and B distances. So reachability distance (A, b), reachability distance (b, A ) are likely to be unequal. because the k nearest neighbor distance between A and B is likely to be different. There's another k-distance here. It is the maximum value of the K nearest neighbor point for each point. Let's solve this function first:

ImportNumPy as NPImportScipy.spatial.distance as SSD fromSklearn.neighborsImportnearestneighborsdefKdistances (X, k = 5): N=X.shape[0] k= min (k, n-1) Nbhrinfos=range (n) KNN= nearestneighbors (metric =Ssd.euclidean). Fit (X) kdists= Np.zeros (n, dtype =np.float) forIinchrange (n):#Max k nearest neighbor distanceDist, _ = knn.kneighbors (x[i], n_neighbors = k + 1) kdists[i]= dist[0 [-1 ]                  #The set of K nearest neighbors as Nk (A) and distances.dists, neighbors = knn.radius_neighbors (x[i], radius =kdists[i]) mask= neighbors[0]! = I#exclude self from k nearest neighbornbhrinfos[i]=[neighbors[0] [mask], dists[0 [mask]]returnKdists, Nbhrinfos

Here, we not only calculate the k-distances distance for each point, we also calculate the point of each point within its k-distances radius and its distance (removing each point itself). This will be followed by calculations reachability distance ready. Then there is the code to calculate the reachability distance:

def reachabilitydistance (A, B, kdistb, dist = none):    if is none:         = Ssd.euclidean (A, B)        return max (KDISTB, Dist)

We use the picture to explain the meaning of the next reachability distance:

For Point A, the k-distance of its 3 neighbors is the distance from point A to B. that is, a circle with a radius of AB . What is the reachability Distance (d, a) from D to a? Note that this is a D-to-a, directional ! Because of the distance between AD, the k-distance (k = 3) is greater than a. So reachability Distance (D, A) = the distance between AD.

By this definition, let's look at another picture:

A point to its 3 nearest neighbors reachability distance is marked in red. And the big arc dashed line is A's k-distance. The small arc dashed line is the distance of the k-distance of a nearest neighbor. It is clear here that if a point is outlier, then its reachability Distance to the nearest neighbor will go far beyond the reachability Distance of theirnearest neighbor. This is a bit of a detour. But reachability distance is a very important measure of the mark we are clear about. Once we have implemented this measure, we will understand how LOF will use it. The author uses the sum of reachability distance for each point to all of its nk-distance neighbors. Then take the countdown and multiply the number of nk-distance neighbors. The author calls it the local reachability density. It means the number of neighbors that can be contained within a distance of a unit. The more it contains, the higher the density of the description . The formula is as follows:

The code is as follows, and here our function is defined for a sample point. This is closer to the definition in the literature.

deflrd (i, X, Kdists, Nbhrinfos): A=x[i] kneighborsofa, Knndistsofa=nbhrinfos[i] sumofreachabilitydistances=0.#A and its NA neighbors reachbility distances     forJinchRange (len (Kneighborsofa)): B=x[kneighborsofa[j] Kdistb=kdists[kneighborsofa[j] Distab=knndistsofa[J] Reachdistance=reachabilitydistance (A, B, Kdistb, Distab) sumofreachabilitydistances+=reachdistance#A ' s densityLRD = Len (kneighborsofa)/sumofreachabilitydistancesreturnLrd

The lrd of each point can be calculated using the nearest neighbor information obtained from the previous kdistances. The lrd of the general outlier point is significantly smaller than that of its nearest neighbor Lrd . So using LRD, the author constructs a outlier fraction lof of the outliers. The formula is as follows:

The formula embodies the author's intention. The author uses the average lrd value of a point nearest neighbor to compare with Lrda. Because if A is outlier, its lrd will be very small, resulting in a significantly higher score ! And the author also proves that the general dense point, the LOF score will converge toward 1. the smaller the score the higher the density. Of course, the setting of the LOF score here is also empirical, and this is Lof's flaw. Scores are not so uniform and easy to identify. Here's the code:

defCallof (i, X, kdists, Nbhrinfos, Lrdcache =None): A=X[i] Kneighborsofa, _=nbhrinfos[i]ifLrdcache isNone:lrdcache= {}            ifI not inchlrdcache:lrdcache[i]=lrd (i, X, kdists, Nbhrinfos) Lrda=lrdcache[i]#A ' s neighbors densitiesSUMOFNEIGHBORSLRD =0. forJinchKneighborsofa:#print I, J        ifJ not inchlrdcache:lrdcache[J]=Lrd (J, X, Kdists, Nbhrinfos) LrdB=lrdcache[J] Sumofneighborslrd+=LrdB#a ' s neighbors average density and a ' s density---> LOFLOF = SUMOFNEIGHBORSLRD/(len (KNEIGHBORSOFA) *Lrda)returnLOF

Finally we integrate the code into a function, this function needs to specify 2 parameters: Minpts is lof to see the number of neighbors, Lofspec is LOF score more than how much is outlier.

def detectoutliers (x, minpts = 5, Lofspec = 3 ):    = x.shape[0]           = kdistances (x, M inpts)        = {}    for in  xrange (n)])    = Np.array (lofs >= LOFSP EC, Dtype = np.int)        return labels, lofs

Let's look at the results of the test:

The LOF scores of these three points were 4.09888044, 11.33202877, and 15.17396308.

Detection of local outliers based on density

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Detection of local outliers based on density

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Detection of local outliers based on density

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support