Dbscan method and application 1.DBSCAN density cluster introduction
The DBSCAN algorithm is a density-based clustering algorithm:
1. Clustering does not require the number of pre-specified clusters
2. The number of final clusters is uncertain
The Dbscan algorithm divides data points into three categories:
1. Core point: In the RADIUS EPS contains more than minpts number of points.
2. Boundary point: The number of points within the radius EPS is less than minpts, but falls within the vicinity of the core point.
3. Noise point: A point that is neither a core point nor a boundary point.
As shown in: The yellow point in the figure is the boundary point, because within the radius EPS, the point within its domain is not more than minpts, we set the minpts here to 5, and the middle white point is the core point, because its neighbors are more than minpts (5) points in the point, The dots in its neighborhood are those yellow dots!
Process of the 2.DBSCAN algorithm
1. Mark all points as core points, boundary points, or noise points;
2. Delete the noise point;
3. Assign an edge to the distance between all core points within the EPS;
4. Each group of connected core points forms a cluster;
5. Assign each boundary point to a cluster of core points associated with it (within the radius of the core point).
3. Application examples
Data introduction
The existing university campus network log data, 290 college students of the campus network usage data, data including user ID, device MAC address, IP address, start the Internet time, stop the Internet time, Internet time, Campus network package. Using existing data, the model of students ' surfing the internet is analyzed.
Experimental purpose
Through Dbscan clustering, we analyze the mode of students ' Internet time and the length of Internet .
Technical Route
Adoption: Sklearn.cluster.DBSCAN Module
For an example of a data show:
Through clustering analysis of the online time and the cluster analysis of the Internet, we want the time of the students to surf the internet and the distribution results of time.
1. Set up the project, import Sklearn related package
Import NumPy as NP
From Sklearn.cluster import DBSCAN
Note: Dbscan main parameters:
1.eps: Two samples are considered the maximum distance from the neighbor node
2.min_samples: Number of samples in a cluster
3.metric: Distance calculation method
Example: Sklearn.cluster.DBSCAN (eps=0.5,min_samples=5,metric= ' Euclidean ') #euclidean表明我们要采用欧氏距离计算样本点的距离!
3-1. online time clustering, create Dbscan algorithm instances, and train to get tags:
4. Output tab, view results
In order to show the result better, we can draw it into the form of histogram, which is easy for us to analyze; we use the Hist function in the Matplotlib library to display the histogram:
5. Draw the histogram to analyze the experimental results:
6. Data Distribution vs Clustering
This is a small machine learning skills, the data distribution on the left is not suitable for clustering analysis, if we want to cluster analysis of such data, we need to do some mathematical transformation of these data, usually we take the logarithm of the transformation method, after the transformation of this data, the transformed data is more suitable for clustering analysis;
3-2. Cluster on the Internet , create an instance of the Dbscan algorithm, and train to get tags:
4-2. Output tab, view results
We can also see: The time-long clustering effect is not as obvious as the clustering effect of time!
5. Unsupervised Learning-dbscan Clustering algorithm and its application