Introduction to Data mining: 2, density clustering

Source: Internet
Author: User

Density Clustering Knowledge Introduction: See this article http://blog.csdn.net/uestcfrog/article/details/6876360

Defined:

1. For an object in space, if the number of objects in the neighborhood of a given radius e is greater than the density threshold minpts, the object is called the core object , otherwise called a boundary object.

2. If P is a core object, Q belongs to the neighborhood of P, then the direct density of P is up to Q.

3. If there is a chain <p1,p2,....., Pi>, to meet the p1=p,pi=q,pi direct density of up to pi+1, then the p density can reach Q.

4. If there is a o,o density of up to Q and P, then p and q are density-connected

5. A cluster is composed of a core object and all objects whose density is reached .

A is the core object, B is the boundary object, and a direct density is up to B,

But B is not directly density up to a, because B is not a core object

C Direct density up to a,a direct density of up to B, so C density up to B,

The same as B density can reach C, but B and C density connectivity

Dbscan from either object p, extracts all from P-density objects according to the parameters E and minpts, and obtains a cluster.

1. Start with either Object p.

A) If P is the core object, then all objects with a direct density of p and p are marked as Class I. Recursion p The direct density of all objects of Qi (i.e., using Qi instead of p to return to the first step).

b) If p is a boundary object, then p is marked as noise.

2. i++

3. If there are any objects that are not marked, choose one of them as P and go back to the first step.

Get a class, likewise we can get another class

Advantages:

1. Insensitive to noise.

2. Can find any shape of the cluster.

Disadvantages:

1. The results of clustering have a great relationship with the parameters.

2. Dbscan uses fixed parameters to identify clusters, but when the degree of sparse clustering is different, the same criteria may destroy the natural structure of the cluster, that is, the more sparse clusters will be divided into multiple classes or dense and relatively close to the class will be merged into a cluster

" "This chapter focuses on density clustering:cluster: The largest set of points connected by densityAdvantages:1. Ability to classify areas with high density into clusters2, can find any shape of the clusterBasic Concepts:ε neighborhood of the object: the area of the given object within the radius εCore object: The neighborhood of an object that contains at least the minimum number of X objectsdirect density up to: The given object Set D, if P is in the neighborhood of Q, Q is the core object, it indicates from the core object Q direct density can reach the object Pdensity (can be considered based on transitivity): The existence of the object chain P1,P2,..., pn,p1=q,pn=p, for Pi belongs to D (1<=i<=n), pi+1 is from Pi(considered to be the core object) with respect to the neighborhood and the direct density of the X, then the object P is from the object Q about the neighborhood and the X-density can be reachedDensity Connection: If an object o exists in the object set D, so that the object P and Q are from O about the neighborhood and x density, then the object p and Q are about the neighborhood andx-Density-connectedNoise: Density-based clusters are a collection of density-linked objects based on the highest density-accessibility. Noise is not included in any of the objects in the cluster. the process of dbscan density clusteringexamine the neighborhood of each object in the dataset to find the cluster. If a point P has more than X objects in its neighborhood, a p is created as the core objectof new clusters. Then, Dbscan repeatedly looking for objects that are directly denser from the core object, and the process may involve a combination of some density-able clusters. the process ends when no new points can be added to any cluster. Advantages: The discovery of arbitrary shape clustering, insensitive to noiseDisadvantages: Large complexity, need to set up a spatial index to reduce computational capacity, the minimum number of points to be determined, uneven density or distance difference, clustering quality declinetime complexity: O (n the time required to find points in the neighborhood), N four points, worst time complexity O (n*n)spatial Complexity: The core object needs to be expanded to the center of the object, the core objects are increased, the memory is larger, the space is O (N)" "
Specific implementations refer to the following two articles
Introduction of Dbscan Clustering ppt:
Http://wenku.baidu.com/link?url= Dhmmqbwfji54mmapgdkvlygrzotomdf7oti-wn493cikpcdr7scrymygwvg0vpeo6nzc8ykiucxgu2s4uadfivg2zexw_ujuh9k5lld2mb_
Density clustering without cluster merging:
http://blog.csdn.net/cang_sheng_ta_ge/article/details/50137667
The Dbscan clustering implemented in this paper can be combined with clusters, and the density is connected, and the code implemented by python3.4 is attached below.
#-*-coding:utf-8-*-import sys, osimport randomfrom collections import defaultdictfrom matplotlib import Pyplot as plt# Def getData (): P1 = [2,1] P2 = [5,1] P3 = [all] P4 = [2,2] P5 = [3,2] P6 = [4,2] P7 = [5,2] P8 = [    6,2] P9 = [1,3] P10 = [2,3] P11 = [5,3] P12 = [2,4] points = [] Points.append (p1) points.append (p2) Points.append (p3) points.append (p4) Points.append (p5) points.append (P6) points.append (P7) Points.append (p  8) Points.append (p9) points.append (p10) points.append (p11) points.append (p12) return points# randomly generate a specified number of point sets Def        Getrandomdata (Minnum, Maxnum, Pointcount): if pointcount <= 0:pointcount = if minnum > maxnum:    Minnum, Maxnum = Maxnum, minnum if minnum = Maxnum and Minnum! = 0:minnum = MAXNUM/2; Allpoints = [] i = 0 while I < pointcount: #这里封装的每一个点其实是一个数组 tmppoint = [Random.randint (Minnum, Max  Num), Random.randint (Minnum, Maxnum)]      If Tmppoint not in AllPoints:allPoints.append (tmppoint) i + = 1 return allpoints# calculates the distance between two points , * * Indicates the exponentiation def distance (VEC1, VEC2): Return ((Vec1[0]-vec2[0]) * * 2 + (vec1[1]-vec2[1]) * * 2) * * 0.5def Issame (point1 , Point2): if point1[0] = = Point2[0] and point1[1] = = Point2[1]: return True else:return falsedef IsC        Orepoint (points, mindistance, Minpointnum, point): Neighbourpoints = [] Count = 0 for Point2 in points:  "flag = Issame (point, Point2) if flag:continue" if distance (point, Point2)        <= Mindistance:count + = 1 neighbourpoints.append (point2) if Count >= (minpointnum):    Return True, neighbourpoints return False, Neighbourpointsdef getneighbourpoints (points, mindistance, point):        #为点添加聚类的类别标签, the default is 0 neighbourpoints = [] for Point2 in points: "' flag = issame (point, Point2) If Flag:contInue "If distance (point, Point2) <= minDistance:neighbourPoints.append (Point2) return n Eighbourpointsdef Isfinish (points): for point in points:if point[-1] = = 0:return False return Tr UE ' DBSCAN algorithm process: Input: DataSet D, parameter minpts,ε output: Cluster collection (1) First, all objects in DataSet D are marked Unvisited, (2) do (3) randomly select a Unvisited object p from D, And the P is marked as visited; the ε neighborhood of if p contains at least minpts to create a new cluster C, and P is added to C; The set of objects in the ε neighborhood that has n p; (7) fo Each point in R N Pi if pi is unvisited labeled pi is visited; if Pi's ε neighborhood has at least minpts objects , add these objects to N, and if Pi is not an object of any cluster. Add pi to cluster C; end for (13) output C (+) Else mark P for noise (untill) No object marked as unvisited ' ' Def mydbscan (point S, Mindistance, minpointnum): #设置访问标记为0, which is initialized to not accessed, 1: Indicates that a noise is already being accessed, 2 indicates that the point belongs to the category for the in Points:po Int.append (0) cluster = defaultdict (lambda: [[],[]]) label = 0 Corepoints = [] while True: #如果没有标IsOk = Isfinish (points) if isok:break #从数据集中随机选取一个未访问过的对象p p = N, as an object that is not accessed                One for point in points:if point[-1] = = 0: #当为-1 indicates already visited point[-1] = 1 p = Point Break if p = = None:break flag, neighbourpoints = Iscore Point (points, mindistance, Minpointnum, p) #如果是核心点, then create cluster c if Flag:label + = 1 clust            Er[label][0].append (P[0]) cluster[label][1].append (p[1]) #设置当前点位核心点 p[-1] = label Corepoints.append (P) # Find neighborhood of current core point n n = getneighbourpoints (points, mindistance, p) #                    Traverse the field in each point pi, if the PI has not been accessed, then set to access, for PI in n:if pi[-1] = = 0:pi[-1] = 1 Flag2, neighbourpoints2= Iscorepoint (points, mindistance, Minpointnum, pi) #如果pi的领域含有至少       Specifies the number of points, the objects are added to the n             If Flag2: #这边拓展之后会动态发生变化, need to process n.extend (NEIGHBOURPOINTS2)                    #如果pi不是任何簇的对象 (already visited and not noise), add pi to cluster c if PI[-1] <= 0:pi[-1] = label Cluster[label][0].append (Pi[0]) cluster[label][1].append (pi[1]) Else: #标记p I for noise p[-1] =-2 print (cluster) return CLUSTERDEF showresult (cluster): #画出密度聚类后的结果 mark = [' or ', ' OB ', ' og ', ' OK ', ' ^r ', ' +r ', ' sr ', ' Dr ', ' <r ', ' PR '] num = Len (Mark) Count = 0 for index, points in Cluster.item  S (): #判断如果是噪声点则过滤 #if index <= 1:if Index < 1:continue color = mark[count% NUM] Count + = 1 Xarr = points[0] Xlen = Len (xarr) Yarr = points[1] for i in range (0, Xlen): Plt.plot (Xarr[i], yarr[i], color) print ("Density cluster final poly:%d class"% count) plt.show () def testextend (): AR r = [+ =] Count = 0    For val in Arr:count + = 1 if count <= 2:temparr = [Count] Arr.extend (Temparr) Print (val) if __name__ = = "__main__": allpoints = Getrandomdata (1, +) allpoints = GetData () mindista  nce = 1 Minpointnum = 4 cluster = Mydbscan (allpoints, Mindistance, minpointnum) #cluster = Mydbscan (allpoints, 8, 8) Showresult (cluster) #testExtend ()


Attached run:
For ease of understanding, the following data sets are attached to the original and easy-to-observe clustering results, which are the same as the results of the above run.

Introduction to Data mining: 2, density clustering

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.