Python-kmeans Algorithm Learning Notes

Source: Internet
Author: User

First, What is clustering

Clustering simply means that a document collection is divided into classes based on the similarity of the document, but how many classes are divided depends on the nature of the document itself in the document collection. The following figure is a simple example where we can aggregate different documents into 3 classes. In addition, clustering is a typical non-guided learning, so-called non-guided learning is not required to intervene, without human documentation labeling.

Two, clustering algorithm: from Sklearn.cluster import Kmeans

__init__ (self, n_clusters=8, init= ' k-means++ ', n_init=10, max_iter=300,tol=1e-4,precompute_distances= ' auto ', Verbose=0, Random_state=none, Copy_x=true, N_jobs=1):

(a) Input parameters:

(1) n_clusters: the number of clusters to be divided is also the centroid to be generated

type: integer type (int)

Default value:8

N_clusters:int, optional, default:8
The number of clusters to form aswell as the number of centroids to generate.

(2) Init: Initialize center of mass,

type: can be a function can be an array(random or Ndarray)

default : Using k-means++ (an algorithm that generates the initial centroid)

kmeans++: The second method of seed point selection.

Kmedoids (Pam,partitioningaround medoids)

It can solve the problem that Kmeans is sensitive to noise. Kmeans the average of all the samples in this class when searching for seed points, if there are obvious outliers in the class, the seed points and expected deviations are too large. For example, A (+), B (2,2), C (3,3), D (1000,1000), it is clear that D points will pull the seed point to offset it. Thus, in the next iteration, a large number of sample points that should not belong to the class are incorrectly zoned into the class.

To solve this problem, the Kmedoids method takes a new seed point selection method, 1) Only from the sample point, 2) the selection criteria can improve the clustering effect, such as the minimization of the J function above, or to customize other cost functions. However, the Kmedoids method improves the complexity of clustering.

init: {' k-means++ ', ' random ' or an ndarray}
Methodfor initialization, defaults to ' k-means++ ':
' k-means++ ': Selects initial clustercenters for K-mean clustering in a smart-to-speed-up convergence. Seesection Notes in K_init for more details.

(3) N_init: Sets the selection centroid seed number, the default is 10 times. Returns the best result of the centroid (well, the length of the calculation)

type: integer type (int)

Default value:10

Objective: every time the algorithm is run, the centroid seeds is randomly generated, so the result can be good and bad. So to run the algorithm n_init times, take the best of them.

N_init:int,default:10
Number of time the K-means algorithm would be run with different centroid seeds. The final results be is the best output of n_init consecutive runs in termsof inertia.

(4) Max_iter: Maximum number of iterations per iteration

type: integral type (int)

Default value:300

max_iter:int,default:300
Maximum number of iterations of the K-means algorithm for a

Single run.

(5) Tol: tolerance of the minimum error, when the error is less than Tol will exit the iteration (the algorithm will depend on the data itself)

type: floating-point type (float)

default value:le-4 (0.0001)

Relativetolerance with regards to inertia to declare convergence

(6) Precompute_distances: This parameter will be a trade-off between space and time, if true will put the entire distance matrix into memory, auto will default in the data sample is greater than featurs*samples When the number of False,false is greater than 12e6, the core implementation of the method is to use the CPython to achieve the

Type: boolean type (Auto,true,false)

default value: "Auto"

Precomputedistances (faster but takes more memory).
' Auto ': Do not precompute distances if n_samples * n_clusters > million. This corresponds to about 100MB overhead per job usingdouble precision.

(7) Verbose: Whether to output detailed information

Type: boolean type (True,false)

Default value: False

Verbose:boolean, optional
Verbositymode.

(8) Random_state: seed of random generator, related to initialization center

type: integral type or NumPy(randomstate, optional)

Default value: None

Random_state:integer or NumPy. Randomstate, optional
The generator used to initialize the centers. If An integer is given, it fixesthe seed. Defaults to the global numpy random number generator.

(9) copy_x:bool in Scikit-learn Many interfaces will have this parameter, is whether to continue the copy operation on the input data, so as not to modify the user's input data. This understanding of Python's memory mechanism will be more clear.

Type: boolean type (boolean,optional)

Default value: True

Whenpre-computing distances It is more numerically accurate to center the Datafirst.  If copy_x is True and then theoriginal data was not modified. If false,the original data is modified, and put back before the function returns, Butsmall numerical differences Troduced by subtracting and then addingthe data mean.

n_jobs: The number of processes used, related to the CPU of the computer

type: integer type (int)

Default value:1

The number of jobs to use for thecomputation. This works by computing
Each of the n_init runs in parallel.
If-1 all CPUs is used. If 1 Isgiven, no parallel Computing code is used @ all, which are useful for debugging. For N_jobs Below-1, (N_cpus + 1 + n_jobs) is used. Thus for n_jobs =-2, Allcpus but one
is used.

(b) Output parameters:

(1) Label_: the cluster category label for each sample

Example:r1 = PD. Series (Model.labels_). Value_counts () #统计各个类别的数目

(2) Cluster_centers_: Cluster Center

return value:array, [N_clusters, N_features]

Example:r2 = PD. DataFrame (Model.cluster_centers_) #找出聚类中心

Examples of Use:

#-*-coding:utf-8-*-#clustering consumption behavior feature data using K-means algorithmImportPandas as PD#Initialization of parametersInputfile='.. /data/consumption_data.xls' #Sales and other attribute dataoutputfile='.. /tmp/data_type.xls' #file name of the saved resultk= 3#Categories of ClusteringIteration= 500#maximum number of cycles in a clusterData= Pd.read_excel (inputfile, Index_col ='Id')#reading DataData_zs= 1.0* (Data-data.mean ())/DATA.STD ()#Data Normalization fromSklearn.clusterImportKmeansmodel= Kmeans (n_clusters = k, n_jobs = 4, Max_iter = iteration)#divided into K class, concurrent number 4Model.fit (Data_zs)#Start Clustering#Simple Print ResultsR1= PD. Series (Model.labels_). Value_counts ()#count the numbers of each categoryR2= PD. DataFrame (Model.cluster_centers_)#Find a cluster centerR= Pd.concat ([R2, r1], Axis = 1)#Horizontal Connection (0 is portrait), get the number under the category corresponding to the cluster centerR.columns= List (data.columns) + [u'Number of categories']#Renaming a table headerPrint(R)#verbose output of raw data and its categoriesR= Pd.concat ([data, PD. Series (model.labels_, index = data.index)], Axis = 1)#Detailoutput the category corresponding to each sample R.columns= List (data.columns) + [u'Cluster category']#Renaming a table headerR.to_excel (outputfile)#Save Results

Evaluation of clustering Analysis algorithm

(i) Achieving the goal: the goal is to achieve that the objects within the group are similar (related) to each other, and the objects in different groups are different (unrelated). The greater the similarity within the group, the greater the difference between groups, the better the clustering effect.

(ii) Methods of Evaluation:

Since clustering divides a collection of documents into several classes, such as if a clustering algorithm should divide a collection of documents into 3 classes rather than 2 or 5 classes, it is designed to be a question of how to evaluate clustering results.

Think that x represents a class of documents, O represents a class of documents, boxes represent a class of documents, the perfect cluster is obviously supposed to put a variety of different graphics into a class, in fact, it is difficult to find the perfect clustering method, the various methods in practice inevitably have deviations, So we need to evaluate the clustering algorithm to see if the method we are using is a good algorithm.

(1) Purity Evaluation method:

The purity method is a very simple clustering evaluation method that only calculates the proportion of documents that are correctly clustered to the total number of documents:

Where ω={ω1,ω2, ..., ωk} is a collection of clusters Omega K represents a collection of k clusters. C = {c1, c2, ..., CJ} is a collection of documents, and CJ represents the first J document. n indicates the total number of documents.

Purity = (4 + 5)/17 = 0.71

Where the first class is correct there are 5, the second 4, the third 3, the total number of documents 17.

The advantage of the purity method is that it is convenient to calculate, the value is between 0~1, the completely wrong clustering method value is 0, the correct method value is 1. At the same time, the disadvantage of the purity method is also obvious that it can not give a correct evaluation of the degraded clustering method, assuming that if the clustering algorithm to separate each document into a class, then the algorithm thinks all documents are correctly classified, then the purity value is 1! And this is obviously not the desired result.

(2) RI Evaluation method:

In fact, this is a method to evaluate the cluster by the principle of permutation and combination, the formula is as follows:

Where TP refers to being gathered in a class of two documents are correctly categorized, TN is only should not be gathered in a class of two documents are correctly separated, FP should not be placed in a class of documents are mistakenly placed in a class, FN refers to the document should not be separated by the wrong separate. Right

TP+FP = C (2,6) + C (2,6) +c (2,5) = 15 + 15 + 10 = 40

where C (n,m) refers to the number of combinations of any of the n selected in M.

TP = C (2,5) + C (2,4) + C (2,3) + c (2,2) = 20

FP = 40-20 = 20

Similarly:

tn+fn= C (1,6) * C (1,6) +c (1,6) * C (1,5) + C (1,6) * C (1,5) =96

fn= C (1,5) * C (1,3) + C (+) *c (1,4) + C (All) * C (1,3) + C (+) * C (+) =24

tn=96-24=72

So ri = (20 + 72)/(20 + 20 + 72 +24) = 0.68

(iii): F-Value Evaluation method

This is based on a method derived from the Ri method described above,

Note: P is the recall, R is the precision ratio, when the β>1 when the recall more impact, when the β<1 precision more influential, when the β=1 degraded to the standard F1, detailed review machine learning P30

One of the features of the RI approach is that the accuracy and recall rates are equally important, and in fact sometimes we may need a bit more of a feature, which is appropriate for the F-value method

precision=tp/(TP+FP)

recall=tp/(TP+FN)

f1=2xrecallxprecision/(recall+precision)

precision=20/40=0.5

recall=20/44=0.455

f1= (2*0.5*0.455)/(0.5+0.455) =0.48

Four, clustering analysis results of reduced dimension visualization-tsne

We always like to be able to visually display the results of the study, and clustering is no exception. However, in general, the characteristics of the input is high-dimensional (more than 3-dimensional), it is generally difficult to directly use the original characteristics of the cluster results display. Tsne provides an efficient way to reduce the dimensionality of data, allowing us to display clustering results in 2-D or 3-dimensional spaces.

Example code: Next to the sample code above

#-*-coding:utf-8-*-#Pick k_means.py fromSklearn.manifoldImportTsnetsne=Tsne () tsne.fit_transform (Data_zs)#make data dimension reductionTsne = PD. DataFrame (tsne.embedding_, index = data_zs.index)#Convert data FormatImportMatplotlib.pyplot as pltplt.rcparams['Font.sans-serif'] = ['Simhei']#used to display Chinese labels normallyplt.rcparams['Axes.unicode_minus'] = False#used to display the negative sign normally#different categories drawing with different colors and stylesD = Tsne[r[u'Cluster category'] ==0]plt.plot (d[0], d[1],'R.') d= Tsne[r[u'Cluster category'] = = 1]plt.plot (d[0], d[1],'Go') d= Tsne[r[u'Cluster category'] = = 2]plt.plot (d[0], d[1],'b*') plt.show ()

Source: Python machine learning Combat (Python data mining)

Python-kmeans Algorithm Learning Notes

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.