Detection of local outliers based on density

Source: Internet
Author: User

Today we introduce the anomaly detection based on density, we take the Lof algorithm as an example . Look first.

C1 and C2 for two cluster,cluster inside the density are very high, but different cluster point of the distance is not quite the same. b relative to C1 is a partial (local) outlier. A is a outlier of the whole (global). So the outlier detection here is not based on distance or near-neighbor. but rather the difference between the density of points and points . if a point and the surrounding points are significantly smaller than the density, then the point is likely to be outlier!

The question is how to define density? To elicit the concept of density. Here we introduce reachability distance. the definition of this distance is this:


This distance has a directional a-->b. It takes the distance of the K nearest neighbor of B and the maximum between A and B distances. So reachability distance (A, b), reachability distance (b, A ) are likely to be unequal. because the k nearest neighbor distance between A and B is likely to be different. There's another k-distance here. It is the maximum value of the K nearest neighbor point for each point. Let's solve this function first:

ImportNumPy as NPImportScipy.spatial.distance as SSD fromSklearn.neighborsImportnearestneighborsdefKdistances (X, k = 5): N=X.shape[0] k= min (k, n-1) Nbhrinfos=range (n) KNN= nearestneighbors (metric =Ssd.euclidean). Fit (X) kdists= Np.zeros (n, dtype =np.float) forIinchrange (n):#Max k nearest neighbor distanceDist, _ = knn.kneighbors (x[i], n_neighbors = k + 1) kdists[i]= dist[0 [-1 ]                  #The set of K nearest neighbors as Nk (A) and distances.dists, neighbors = knn.radius_neighbors (x[i], radius =kdists[i]) mask= neighbors[0]! = I#exclude self from k nearest neighbornbhrinfos[i]=[neighbors[0] [mask], dists[0 [mask]]returnKdists, Nbhrinfos

Here, we not only calculate the k-distances distance for each point, we also calculate the point of each point within its k-distances radius and its distance (removing each point itself). This will be followed by calculations reachability distance ready. Then there is the code to calculate the reachability distance:

def reachabilitydistance (A, B, kdistb, dist = none):    if is none:         = Ssd.euclidean (A, B)        return max (KDISTB, Dist)

We use the picture to explain the meaning of the next reachability distance:

For Point A, the k-distance of its 3 neighbors is the distance from point A to B. that is, a circle with a radius of AB . What is the reachability Distance (d, a) from D to a? Note that this is a D-to-a, directional ! Because of the distance between AD, the k-distance (k = 3) is greater than a. So reachability Distance (D, A) = the distance between AD.

By this definition, let's look at another picture:

A point to its 3 nearest neighbors reachability distance is marked in red. And the big arc dashed line is A's k-distance. The small arc dashed line is the distance of the k-distance of a nearest neighbor. It is clear here that if a point is outlier, then its reachability Distance to the nearest neighbor will go far beyond the reachability Distance of theirnearest neighbor. This is a bit of a detour. But reachability distance is a very important measure of the mark we are clear about. Once we have implemented this measure, we will understand how LOF will use it. The author uses the sum of reachability distance for each point to all of its nk-distance neighbors. Then take the countdown and multiply the number of nk-distance neighbors. The author calls it the local reachability density. It means the number of neighbors that can be contained within a distance of a unit. The more it contains, the higher the density of the description . The formula is as follows:

The code is as follows, and here our function is defined for a sample point. This is closer to the definition in the literature.

deflrd (i, X, Kdists, Nbhrinfos): A=x[i] kneighborsofa, Knndistsofa=nbhrinfos[i] sumofreachabilitydistances=0.#A and its NA neighbors reachbility distances     forJinchRange (len (Kneighborsofa)): B=x[kneighborsofa[j] Kdistb=kdists[kneighborsofa[j] Distab=knndistsofa[J] Reachdistance=reachabilitydistance (A, B, Kdistb, Distab) sumofreachabilitydistances+=reachdistance#A ' s densityLRD = Len (kneighborsofa)/sumofreachabilitydistancesreturnLrd

The lrd of each point can be calculated using the nearest neighbor information obtained from the previous kdistances. The lrd of the general outlier point is significantly smaller than that of its nearest neighbor Lrd . So using LRD, the author constructs a outlier fraction lof of the outliers. The formula is as follows:

The formula embodies the author's intention. The author uses the average lrd value of a point nearest neighbor to compare with Lrda. Because if A is outlier, its lrd will be very small, resulting in a significantly higher score ! And the author also proves that the general dense point, the LOF score will converge toward 1. the smaller the score the higher the density. Of course, the setting of the LOF score here is also empirical, and this is Lof's flaw. Scores are not so uniform and easy to identify. Here's the code:

defCallof (i, X, kdists, Nbhrinfos, Lrdcache =None): A=X[i] Kneighborsofa, _=nbhrinfos[i]ifLrdcache isNone:lrdcache= {}            ifI not inchlrdcache:lrdcache[i]=lrd (i, X, kdists, Nbhrinfos) Lrda=lrdcache[i]#A ' s neighbors densitiesSUMOFNEIGHBORSLRD =0. forJinchKneighborsofa:#print I, J        ifJ not inchlrdcache:lrdcache[J]=Lrd (J, X, Kdists, Nbhrinfos) LrdB=lrdcache[J] Sumofneighborslrd+=LrdB#a ' s neighbors average density and a ' s density---> LOFLOF = SUMOFNEIGHBORSLRD/(len (KNEIGHBORSOFA) *Lrda)returnLOF

Finally we integrate the code into a function, this function needs to specify 2 parameters: Minpts is lof to see the number of neighbors, Lofspec is LOF score more than how much is outlier.

def detectoutliers (x, minpts = 5, Lofspec = 3 ):    = x.shape[0]           = kdistances (x, M inpts)        = {}    for in  xrange (n)])    = Np.array (lofs >= LOFSP EC, Dtype = np.int)        return labels, lofs

Let's look at the results of the test:

The LOF scores of these three points were 4.09888044, 11.33202877, and 15.17396308.

Detection of local outliers based on density

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.