Learning Dbscan Clustering with Scikit-learn

Source: Internet
Author: User

In the dbscan density clustering algorithm, we summarize the principle of dbscan clustering algorithm, and this paper summarizes how to use Scikit-learn to learn Dbscan clustering, focusing on the significance of parameters and the parameters that need to be adjusted.

1. Dbscan class in Scikit-learn

In Scikit-learn, the DBSCAN algorithm class is Sklearn.cluster.DBSCAN. To master the use of Dbscan class to gather class, in addition to the principle of dbscan itself has a deeper understanding, but also to the nearest neighbor's ideas have a certain understanding. Set the two, you can play a turn dbscan.

2. Important parameters of Dbscan class

The important parameters of the Dbscan class are also divided into two categories, one is the parameters of the Dbscan algorithm itself, one is the nearest neighbor metric parameter, the following we make a summary of these parameters.

1)EPS: Dbscan algorithm parameters, that is, the distance threshold of our $\epsilon$-neighborhood, and sample points over $\epsilon$ samples are not in the $\epsilon$-neighborhood. The default value is 0.5. It is generally necessary to select a suitable threshold value within multiple sets of values. EPS is too large, then more points will fall in the $\epsilon$-neighborhood of the core object, at this time our class number may be reduced, should not be a class of samples will also be classified as a category. Conversely, the number of classes may increase, but a sample that is originally a category is divided.

2)min_samples: Dbscan algorithm parameters, that is, the sample point to be the core object of the required $\epsilon$-neighborhood sample number threshold. The default value is 5. It is generally necessary to select a suitable threshold value within multiple sets of values. It is usually used with the EPS to adjust the parameters. In the case of EPs, the min_samples is too large, then the core objects will be too small, when the cluster is a sample of a class may be marked as noise point, the number of categories will also become more. Conversely, if the min_samples is too small, it will produce a large number of core objects, which may result in too few categories.

3)metric: Nearest neighbor distance metric parameter. You can use a lot of distance measurement, in general Dbscan use the default European distance (that is, p=2 Minkowski distance) to meet our needs. The distance metric parameters that can be used are:

A) European distance "Euclidean": $ \sqrt{\sum\limits_{i=1}^{n} (x_i-y_i) ^2} $

b) Manhattan Distance "Manhattan": $ \sum\limits_{i=1}^{n}|x_i-y_i| $

c) Chebyshev distance "Chebyshev": $ max|x_i-y_i| (i =,... N) $

d) Minkowski distance "Minkowski": $ \sqrt[p]{\sum\limits_{i=1}^{n} (|x_i-y_i|) ^P} $ p=1 for Manhattan distance, p=2 for European distance.

e) with weight Minkowski distance "Wminkowski": $ \sqrt[p]{\sum\limits_{i=1}^{n} (w*|x_i-y_i|) ^P} $ where w is the feature weight

f) Standardized Euclidean distance "Seuclidean": that is, for each feature dimension has been normalized after the European distance. At this point, the average value of each sample feature dimension is 0, and the variance is 1.

g) Markov distance "Mahalanobis": $\sqrt{(x-y) ^ts^{-1} (x-y)}$ wherein, $S ^{-1}$ is the inverse matrix of the sample covariance matrix. When the sample is distributed independently, S is the unit matrix, at which time the Markov distance is equal to the Euclidean distance.

There are other distance measures that are not real numbers, which are generally not used in the Dbscan algorithm, and are not listed here.

4)algorithm: The nearest neighbor search algorithm parameters, the algorithm a total of three, the first is the brute force implementation, the second is the KD tree implementation, the third is the ball tree implementation. These three methods are described in the K-Nearest neighbor Method (KNN) principle Summary, if not familiar with can go to review under. For this parameter, a total of 4 optional inputs, ' brute ' corresponds to the first brute force implementation, ' Kd_tree ' corresponding to the second KD tree implementation, ' Ball_tree ' corresponding to the third kind of ball tree implementation, ' Auto ' will be in the above three algorithms to make a trade-off, select a fitting best algorithm. It is important to note that if the Input sample feature is sparse, whichever algorithm we choose, finally Scikit-learn will use brute force to achieve ' brute '. Personal experience, the general case of using the default ' auto ' is enough. If the amount of data is large or a lot of characteristics, with "auto" build time may be very long, inefficient, recommended to select KD tree to achieve ' kd_tree ', at this time if found ' kd_tree ' speed is relatively slow or already know sample distribution is not very uniform, you can try to use ' ball_tree '. And if the input sample is sparse, whichever algorithm you choose ends up actually running ' brute '.

5)leaf_size: The nearest neighbor search algorithm parameter, for the use of KD tree or ball tree, stop building the subtree of the leaf node number threshold. The smaller the value, the larger the KD tree or the ball tree, the deeper the number of layers, the longer the build time, the lower the KD tree or the ball tree, the smaller the number of layers, and the shorter the building time. The default is 30. Because this value generally only affects the speed of the algorithm and the use of memory size, it is generally possible to ignore it.

6) P: Nearest neighbor distance metric parameter. It is only used for Minkowski distance and the choice of P-value in Minkowski distance, P=1 is Manhattan distance, p=2 is European distance. If you use the default Euclidean distance, you do not need to pipe this parameter.

The above is the main parameters of the Dbscan class introduced, in fact, the need to adjust the parameter is two parameters EPs and min_samples, the combination of the two values of the final clustering effect has a great impact.

3. Scikit-learn Dbscan Cluster instance

First, we generate a set of random data, in order to reflect the dbscan in the non-convex data clustering advantages, we generated three clusters of data, two groups are non-convex. The code is as follows:

ImportNumPy as NPImportMatplotlib.pyplot as Plt fromSklearnImportDatasets%matplotlib inlineX1, y1=datasets.make_circles (n_samples=5000, factor=.6, Noise=.05) X2, y2= Datasets.make_blobs (n_samples=1000, n_features=2, centers=[[1.2,1.2]], cluster_std=[[.1]], random_state=9) X=Np.concatenate ((X1, X2)) Plt.scatter (x[:, 0], x[:,1], marker='o') plt.show ()

We can intuitively look at our sample data distribution output:

First we look at the clustering effect of K-means, the code is as follows:

 from Import  = Kmeans (n_clusters=3, random_state=91], c=y_pred) plt.show ()

K-means for non-convex datasets clustering performance is not good, from the above code output clustering can be seen clearly, the output diagram is as follows:

So what if the Dbscan effect is used? We do not adjust the parameter, directly with the default parameters, see the cluster effect, the code is as follows:

 from Import  =1], c=y_pred) plt.show ()

found that the output let us very dissatisfied, Dbscan incredibly think all the data are a kind! The output is as follows:

What to do? It seems that we need to dbscan the two key parameters of EPs and Min_samples! From what we can find, the number of categories is too small, we need to increase the number of categories, then we can reduce the size of the $\epsilon$-neighborhood, the default is 0.5, we reduce to 0.1 to see the effect. The code is as follows:

y_pred = DBSCAN (eps = 0.11], c=y_pred) plt.show ()

The corresponding clusters are as follows:

You can see that the clustering effect has improved, at least the cluster on the side has been discovered. At this point we need to continue to increase the parameters of the category, there are two directions are possible, one is to continue to reduce EPS, the other is to increase the min_samples. We now add Min_samples from the default of 5 to 10, the code is as follows:

y_pred = DBSCAN (eps = 0.1, min_samples =1], c=y_pred) plt.show ()

The output is as follows:

It can be seen that the clustering effect is now basically able to let us satisfaction.

The above example is just to help you understand a basic idea of dbscan, in the actual application may have to consider a lot of problems, as well as more parameter combinations, I hope this example can give you some inspiration.

(Welcome reprint, reproduced please indicate the source.) Welcome to communicate: [email protected])

? ?

-Adjacent domain ? ?

-Adjacent domain

Learning Dbscan Clustering with Scikit-learn

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.