K-means algorithm
This is a clustering algorithm based on partitioning, which is highly efficient and widely used in clustering large-scale data.
Basic idea: Divide the DataSet into K clusters, the samples within each cluster are very similar, the difference between different clusters is very large.
K-means algorithm is an iterative algorithm, first randomly select K objects, each object represents the center, for the remainder of the object, assign it to the nearest cluster, and then recalculate the center of the cluster. Repeat until the benchmark function converges.
Algorithm:
1 Data preprocessing
L Continuous attributes: standardization, such as
L Discrete properties: binary encoding. By introducing the regulation factor, the influence of discrete properties is much different than that of continuous attributes.
2 Determine the initial centroid (some random selection)
① Select first as the first centroid
② other samples with the first centroid Euclidean distance farthest from the second
③ continuously repeats the above two steps to determine the K centroid.
3 Assigned Samples
Calculate the distance from each sample point to the K, and categorize it into it.
4 Update the particle
5 Stop criteria
Maximum number of iterations
Tolerance of difference