Implementation of Kmeans Clustering in K-means+python︱scikit-learn (+ Minibatchkmeans)

Source: Internet
Author: User

I've been using R before and now we're going to try python to implement Kmeans.
Before using R to achieve Kmeans blog: note ︱ A variety of common clustering models and clustering quality assessment (clustering considerations, usage Tips)

Clustering is extremely important in customer segmentation. There are three kinds of more common clustering models, K-mean clustering, Hierarchical (System) clustering, maximum expected EM algorithm. In the process of establishing the cluster model, a key problem is how to evaluate the clustering results and evaluate them with some indexes.
.

I. Introduction of Kmeans in Scikit-learn

Scikit-learn is a python-based machine learning module that gives a lot of machine
Learning-related algorithm implementation, including the K-means algorithm.

Official Scikit-learn case Address: Http://scikit-learn.org/stable/modules/clustering.html#k-means
Part from: Scikit-learn Source code interpretation of kmeans--simple algorithm complex said

Performance comparisons for each cluster:

优点:原理简单速度快对大数据集有比较好的伸缩性缺点:需要指定聚类 数量K对异常值敏感对初始值敏感
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
1. Related theories

Reference: K-means algorithm and text clustering practice

    • (1) Selection of the center point

The K-MEAMS algorithm can guarantee convergence, but it can not guarantee convergence to the most advantages of the global, when the initial center point selection is not good, can only reach the local optimal point, the whole cluster effect will be poor. You can use the following methods: K-means Center Point

Select those points as far away from each other as the center point;
First, a hierarchy of initial cluster output K clusters, with the center of the cluster as the center point of the input K-means.
Multiple random selection of center point training K-means, select the best result of clustering

    • (2) Selection of K value

There is a big flaw in the error function of K-means, that is, as the number of clusters increases, the error function approaches 0, the most extreme situation is that each record is a separate cluster, when the data record error is 0, but such clustering results are not what we want, we can introduce the structural risk to punish the complexity of the model:

Λλ is the parameter of the balance training error and the number of clusters, but now the problem has become how to choose λλ, there is research [reference 1] that, when the data set satisfies the Gaussian distribution, λ=2mλ=2m, where M is the dimension of the vector.

Another method is to try the different K values in ascending order, and draw their corresponding error values, and find a better k value by finding a inflection point, see the example of text clustering below for details.

2. Main function Kmeans

Reference blog: Python's Sklearn learning notes
Take a look at the main function Kmeans:

sklearn.cluster.KMeans(n_clusters=8,     init=‘k-means++‘,     n_init=10,     max_iter=300,     tol=0.0001, precompute_distances=‘auto‘, verbose=0, random_state=None, copy_x=True, n_jobs=1, algorithm=‘auto‘ )
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12

The meaning of the parameter:

    • N_clusters: The number of clusters, which you want to gather into several categories
    • Init: How to get the initial cluster center
    • N_init: Gets the number of changes in the initial cluster center, in order to compensate for the initial centroid, the algorithm defaults to 10 centroid, implements the algorithm, and returns the best results.
    • Max_iter: Maximum number of iterations (because the implementation of the Kmeans algorithm requires iteration)
    • Tol: tolerance, i.e. conditions for convergence of Kmeans operational criteria
    • Precompute_distances: Whether you need to calculate the distance in advance, this parameter will be a trade-off between space and time, if true will put the entire distance matrix into memory, auto will default in the number of data samples greater than featurs*samples more than 12e6 When false,false the core implementation of the method is to use the CPython to achieve
    • VERBOSE: Verbose mode (not quite understand what it means, anyway, usually do not change the default value)
    • Random_state: Randomly generates a state condition for a cluster center.
    • Copy_x: A token of whether the data is modified, and if true, the data is not modified if it is copied. BOOL will have this parameter in many Scikit-learn interfaces, that is, whether to continue the copy operation on the input data so that the user's input data is not modified. This understanding of Python's memory mechanism will be more clear.
    • N_jobs: Parallel settings
    • Algorithm:kmeans implementation algorithm, there are: ' Auto ', ' full ', ' Elkan ', where ' full ' means to implement in EM way

Although there are many parameters, the default values are already given. So we generally do not need to pass in these parameters, parameters. Can be called according to actual needs.

3, simple case one

Reference blog: Python's Sklearn learning notes
This case illustrates how some classes of kmeans analysis are tuned and what they mean.

Import NumPy as Npfrom Sklearn.cluster import kmeansdata = Np.random.rand (100, 3)  #生成一个随机数据, sample size is 100, feature number is 3# If I were to construct a cluster with a cluster number of 3 Estimator = Kmeans (N_clusters=3) # Construction cluster Estimator.fit (data)  #聚类label_pred = Estimator.labels_  #获取聚类标签centroids = estimator< Span class= "Hljs-preprocessor" >.cluster_centers_  #获取聚类中心inertia = Estimator.inertia_ # get the sum of the clustering criteria     
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10

Estimator Initialize Kmeans clustering; estimator.fit clustering content fitting;
ESTIMATOR.LABEL_ cluster tags, this is a way, there is a predict;estimator.cluster_centers_ cluster center mean vector matrix
Estimator.inertia_ represents the sum of the mean vectors of the cluster centers

4. Case two

Case from: Using Scikit-learn for Kmeans text clustering

from sklearn.cluster import KMeansnum_clusters = 3km_cluster = KMeans(n_clusters=num_clusters, max_iter=300, n_init=40, init=‘k-means++‘,n_jobs=-1)#返回各自文本的所被分配到的类索引result = km_cluster.fit_predict(tfidf_matrix)print "Predicting result: ", result
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10

Km_cluster is the Kmeans initialization, which uses Init's initial value selection algorithm with ' k-means++ ';
Km_cluster.fit_predict is equivalent to merging two actions: Km_cluster.fit (data) +km_cluster.predict, which can be used to get the tag after clustering prediction, eliminating the intermediate process.

    • N_clusters: Specifies the value of K
    • Max_iter: Maximum number of iterations for a single initial value calculation
    • N_init: Number of times the initial value is re-selected
    • Init: An algorithm for setting the initial value selection
    • N_jobs: The number of processes, which is 1, means that the CPU is run full by default
    • Note that this calculation for a single initial value will always only use a single process calculation,
    • Parallel computing is only for calculations with different initial values. Like N_init=10,n_jobs=40,
    • There are 20 CPUs on the server that can open 40 processes, and eventually only 10 processes will be opened.

which

km_cluster.labels_km_cluster.predict(data)
    • 1
    • 2

This is the way the output of the two clustering results labels looks the same. You need to km_cluster.fit (data) before you call it.

5. Follow-up analysis of case four--kmeans

Some analysis after the Kmeans algorithm, reference Source: Implementing document Clustering with Python

from sklearn.cluster import KMeansnum_clusters = 5km = KMeans(n_clusters=num_clusters)%time km.fit(tfidf_matrix)clusters = km.labels_.tolist()
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10

Divided into five categories, while using%time to determine the running time, the classification label labels format into a list.

    • (1) model saving and loading
from sklearn.externals import joblib# 注释语句用来存储你的模型joblib.dump(km,  ‘doc_cluster.pkl‘)km = joblib.load(‘doc_cluster.pkl‘)clusters = km.labels_.tolist()
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • (2) Cluster category statistics
index = [clusters] , columns = [‘rank‘, ‘title‘, ‘cluster‘, ‘genre‘])frame[‘cluster‘].value_counts()
    • 1
    • 2
    • (3) The centroid mean vector calculates the sum of squares within the group

Select a point closer to the centroid, where Km.cluster_centers_ represents a (number of clusters * dimensions), which is the mean of different clusters and different dimensions.
The indicator can be known as:
Within a category, those points are closer to the centroid;
The sum of squares within the entire category group.

The intra-group squares within the category are referenced in the following formula:


It can be seen from the formula:
centroid mean vector each row of values-the mean value of each row (equivalent to mean value)
Attention is squared. where n represents the sample size, K is the number of clusters (e.g. cluster 5)
The sum of squares within the entire group can be obtained by:

km.inertia_

Implementation of Kmeans Clustering in K-means+python︱scikit-learn (+ Minibatchkmeans)

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.