Common clustering algorithms

Source: Internet
Author: User

1.k-means algorithm

K-means algorithm is a hard clustering algorithm, is a typical prototype-based target function clustering method representative, it is the data point to the prototype of a certain distance as the objective function of optimization, using the function to find the extremum of the method of the iterative operation of the adjustment rules. The K-means algorithm takes the Euclidean distance as the similarity measure, it is the optimal classification of the vector v corresponding to a certain initial cluster center, which makes the evaluation index J minimum. The algorithm uses the error square sum criterion function as the cluster criterion function. The formula is as follows:

    The selection of the center point of k initial clustering has a great influence on the clustering result, because the first step of the algorithm is to randomly select any K object as the center of the initial cluster and initially represent a cluster. In each iteration, the algorithm assigns each object to the nearest cluster, depending on its distance from the center of each cluster, for each remaining object in the data set. When all the data objects are examined, a new cluster center is computed, once an iterative operation is completed. If the value of J has not changed before and after an iteration, the algorithm has converged. The algorithm process is as follows: Input: The number of clusters K, and the database containing N data objects. Output: K clusters that meet the minimum variance criteria. 1) randomly select K documents from N documents as centroid 2) for each remaining document to measure its distance to each centroid and classify it to the nearest centroid of Class 3) to recalculate the centroid of the obtained classes by 4) iterate two to three steps until the new centroid is equal to or less than the original centroid, and the algorithm ends 2.DBScan AlgorithmDBSCAN (density-based Spatial Clustering of applications with Noise) is a relatively representative density-based clustering algorithm. Unlike the partitioning and hierarchical clustering methods, it defines clusters as the largest set of points connected by density, can divide the areas with sufficient density into clusters, and can discover arbitrary shapes in the noisy spatial database. In the Dbscan. several definitionsε Field: The range of the given object radius to ε is called the ε field of the object; Core Objects: If the number of sample points in the ε field of a given object is greater than or equal to minpts, the object is called the core object; direct density up to: For Sample Set D, if the sample point q is in the ε field of P, and P is the core object, then the object q is directly density from the object p. density up to: For Sample Set D, given a bunch of sample points p1,p2....pn,p= p1,q= pn, if the object pi from the pi-1 direct density can reach, then the object Q from the object P density can be reached. Density Connection: There is a point O in the sample set D, if the object o to the object P and the object q are all densities, then the p and Q densities are associated. It can be found that the density is up to the direct density of the transitive closure, and this relationship is asymmetric. Density is connected to a symmetric relationship. The Dbscan purpose is to find the largest set of density-connected objects. Dbscan Algorithm Description: Input: Database containing n objects, radius e, minimum number of minpts; output: All generated clusters to meet density requirements. (1) Repeat (2) extracts an unhandled point from the database, and (3) if the point extracted is the core point then finds all the objects from that point density to form a cluster; (4) ELSE The point is the edge point (non-core object), out of the loop, looking for the next point; (5) UNTIL All the points are processed. Dbscan is sensitive to user-defined parameters, and subtle differences can lead to very different results, and the selection of parameters is not regular and can only be determined by experience. 3. The PCA algorithm PCAPCA (Principal Component analysis) is used to find the subspace and then to determine the system's anomalies through the outliers of the subsystem (not finished ...). adjourned

Common clustering algorithms

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.