Kmeans Preliminary Study Summary

Source: Internet
Author: User
Tags new set

Contact Kmeans algorithm for a long time, but has not been well understood how. Recommend a few good links.

Http://coolshell.cn/articles/7779.html

http://blog.csdn.net/zouxy09/article/details/9982495

Http://www.360doc.com/content/13/1122/14/10724725_331295214.shtml

One of the most basic procedures for using MATLAB functions

yangben= load (' F:\iris.txt '); S=size (Yangben); hang=s (1); Lie=s (2); X=yangben (:, 1:4); Opts=statset (' Display ', ' final ' ); k=3; [Idx,ctrs]=kmeans (X,k, ' Distance ', ' City ', ' replicates ', 5, ' options ', opts);     Plot (x (idx==1,1), X (idx==1,2), ' R. ',...     X (idx==2,1), X (idx==2,2), ' B. ',...     X (idx==3,1), X (idx==3,2), ' G. '); CTRs (:, 1), CTRs (:, 2), CTRs (:, 3), ' KX ';               

Overall, it's also possible because the dataset is authoritative.

Post the help document for Kmeans, and then study it later. This function will be used for the moment. After reading this, we can finally take a good look at the problem of sparse coding for feature extraction.

Help Kmeans
Kmeans K-means Clustering.
IDX = Kmeans (X, K) partitions the points in the N-by-p data matrix X
into K clusters. This partition minimizes the sum, over all clusters, of
The within-cluster sums of point-to-cluster-centroid distances. Rows of X
correspond to points, columns correspond to variables. Note:when X is a
Vector, Kmeans treats it as an n-by-1 data matrix, regardless of its
Orientation. Kmeans returns an n-by-1 vector IDX containing the cluster
Indices of each point. By default, Kmeans uses squared Euclidean
Distances.

Kmeans treats NaNs as missing data, and ignores any rows of X that
Contain NaNs.

[IDX, C] = Kmeans (X, K) returns the K cluster centroid locations in
The k-by-p matrix C.

[IDX, C, SUMD] = Kmeans (X, K) returns the Within-cluster sums of
Point-to-centroid distances in the 1-by-k vector SUMD.

[IDX, C, SUMD, D] = Kmeans (X, K) returns distances from each point
To every centroid in the n-by-k matrix D.

[ ... ] = Kmeans (..., ' PARAM1 ', Val1, ' PARAM2 ', Val2, ...) specifies
Optional parameter Name/value pairs to control the iterative algorithm
Used by Kmeans. Parameters is:

' Distance '-Distance measure, in p-dimensional space, that Kmeans
Should minimize with respect to. Choices is:
' Sqeuclidean '-squared Euclidean distance (the default)
' Cityblock '-Sum of absolute differences, a.k.a. L1 Distance
' Cosine '-one minus the cosine of the included angle
Between points (treated as vectors)
' Correlation '-one minus the sample correlation between points
(treated as sequences of values)
' Hamming '-Percentage of bits that differ (only suitable
for binary data)

' Start '-Method used to choose initial cluster centroid positions,
Sometimes known as "seeds". Choices is:
' Sample '-Select K observations from X at random (the default)
' Uniform '-Select K points uniformly at random from the range
of X. Not valid for Hamming distance.
' Cluster '-Perform preliminary clustering phase on random 10%
Subsample of X. This preliminary phase is itself
Initialized using ' sample '.
Matrix-a k-by-p matrix of starting locations. In the case,
Can pass in [] for K, and Kmeans infers K from
The first dimension of the matrix. You can also
Supply a 3D array, implying a value for ' replicates '
From the array ' s third dimension.

' Replicates '-number of times to repeat the clustering, each with a
New set of initial centroids. A positive integer, default is 1.

' Emptyaction '-Action to take if a cluster loses all of its member
Observations. Choices is:
' Error '-Treat an empty cluster as an error (the default)
' Drop '-Remove any clusters that become empty, and set
The corresponding values in C and D to NaN.
' Singleton '-Create a new cluster consisting of the one
Observation furthest from its centroid.

' Options '-options for the iterative algorithm used to minimize the
Fitting criterion, as created by Statset. Choices of Statset
Parameters is:

' Display '-level of display output. Choices is ' off ', (the
Default), ' ITER ', and ' final '.
' Maxiter '-Maximum number of iterations allowed. Default is 100.

' Onlinephase '-Flag indicating whether Kmeans should perform an "on-line
Update "phase In addition to a" batch update phase. The on-line phase
Can is time consuming for large data sets, but guarantees a solution
That is a local minimum of the distance criterion, i.e., a partition of
The data where moving any different cluster increases
The total sum of distances. ' On ' (the default) or ' off '.

Example:

X = [Randn (20,2) +ones (20,2); Randn (20,2)-ones (20,2)];
opts = Statset (' Display ', ' final ');
[Cidx, CTRs] = Kmeans (X, 2, ' Distance ', ' City ', ...
' Replicates ', 5, ' Options ', opts);
Plot (x (cidx==1,1), X (cidx==1,2), ' R. ', ...
X (cidx==2,1), X (cidx==2,2), ' B. ', CTRs (:, 1), CTRs (:, 2), ' KX ');

Kmeans Preliminary Study Summary

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.