Unsupervised learning
Unlike supervised learning, data is not labeled (categorized) in unsupervised learning. Unsupervised learning requires algorithms to find the inherent laws of these data and classify them. (as in the data, and there is no label, it can be seen that the data set can be divided into three categories, it is an unsupervised learning process.) )
Unsupervised learning does not have a training process.
Clustering algorithm
The algorithm will be similar to the object track in the same cluster, a bit like automatic classification. The more similar an object within a cluster, the better its classification effect.
The concept of non-contact may feel very tall, a little look at the algorithm and the idea of the same as the KNN is very simple.
The original data set is as follows (the data has two features, respectively, with a horizontal ordinate), and the original dataset does not have any label or categorical information:
The data in the figure can be roughly judged, the data set can be divided into three types of data (defined as 0,1,2), then each point exactly belongs to which classification, here through the K-means clustering algorithm to obtain three centroid points, and according to the distance from each point to three centroid classification (to 0,1,2 three centroid distance nearest, The data is divided into this class), the calculated three centroid points such as (the Red Fork points in the figure):
K-Means Clustering algorithm
The flow of the algorithm is as follows:
1. Load data set 2. Data initialization 2.1 creating a random centroid point 2.2 The individual matrices/arrays that hold the result of 3. Multiple iterations (judging whether the classification of all points has changed) 3.1 calculate a bit of classification 3.2 Recalculate centroid points (averaging as new centroid points with data belonging to the current class) according to the 3.1 classification results 4. Return data
The disadvantage of the algorithm:
The algorithm tends to converge to the local minimum rather than the global minimum value. (Local minimum refers to the result can also, but not the best result, the best possible result of global minimum value)
A two-part K-Means clustering algorithm
SSE: Metric for clustering effect (Sum of squared Erro, squared error)
The smaller the SSE, the better the clustering effect is when all data points are closer to their centroid.
The flow of the algorithm is as follows:
1. Treat all points as a cluster 2. When the number of clusters is less than 2.1, the total error of each cluster 2.1.1 is calculated 2.1.2 The K-means clustering (k=2) on the given cluster 2.1.2 calculates the total error after splitting in the cluster by 2.2 Select the cluster that is the smallest error to divide
Python implementation
Data loading
def Loaddataset (fileName): #general function to parse tab-delimited floats datamat = [] #assume last column is target value Open (fileName) For line in Fr.readlines(): curline = line. Strip(). Split (' \ t ') #map all elements to float () Datamat.append (fltline) return Datamat
The data is in the form of the following, and the most important difference in the form of supervised learning data is that the data is not labeled. Each data is a two-dimensional vector.
3.275154 2.957587-3.344465 2.6035130.355083 -3.3765851.852435 3.547351-2.078973 2.552013-0.993756 -0.8844332.682252 4.007573-3.087776 2.878713-1.565978 -1.2569852.441611 0.444826-0.659487 3.111284-0.459601 -2.6180052.177680 2.387793-2.920969 2.917485-0.028814 -4.1680783.625746 2.119041-3.912363 1.325108-0.551694 -2.8142232.855808 3.483301 ............
Vector Euclidean distance calculation function
def Disteclud (Veca, VECB): return#la. Norm (VECA-VECB)
Random generation of N centroid
def randcent (DataSet, K): n = shape (DataSet) [1] centroids = Mat (Zeros ((k,n)))#create centroid Mat for J in Range (n):#create The random cluster centers, within bounds of each dimension minj = min (datase T[:,J]) Rangej = float (max (dataset[:,j))-Minj) centroids[:,j] = Mat (Minj + Rangej * Random.rand (k,1)) return Centroids
K-Means Clustering algorithm
Disadvantage: The algorithm must enter the number of categories in advance of the business K.
The function returns the centroid coordinate centroids, and the nearest centroid of each point (that is, the classification result of that point) and its distance clusterassment.
It is important to note the termination condition of the iteration: clusterchanged, which is used to mark whether the iteration has a different classification of data and the previous one, and if the current iteration's classification of all data is identical to the previous classification result, the iteration is no longer resumed.
defKmeans (DataSet, K, Distmeas=disteclud, createcent=randcent):#计算数据个数m = shape (DataSet) [0]# The nearest centroid distance to which to store each data, and its distance valueClusterassment = Mat (Zeros ((m,2)))#create Mat To assign data points #to A centroid, also holds SE of each pointCentroids = Createcent (DataSet, K)#产生随机的质心点 (by iteration, gradually becoming precise)clusterchanged = True#分类是否改变, the condition of the end of the iterationWhile clusterchanged:clusterchanged = False for I in Range (m):#for Each data point assign it to the closest centroidMindist = inf; Minindex = 1 for j in Range (k): Distji = Distmeas (Centroids[j,:],dataset[i,:])ifDistji < mindist:mindist = Distji; Minindex = Jifclusterassment[i,0]! = minindex:clusterchanged = True clusterassment[i,:] = minindex,mindist**2PrintCentroids for cent in range (k):#recalculate centroids #ptsInClust表示到该质心距离最近的点集合Ptsinclust = Dataset[nonzero (clusterassment[:,0]. A==cent) [0]]#get All the "this cluster the #将质心坐标 is replaced with the mean of the nearest point coordinate, so it is called mean-value clustering algorithm .Centroids[cent,:] = mean (Ptsinclust, axis=0)#assign centroid to mean returnCentroids, Clusterassment
Two-part K-Means algorithm
The input and output of the algorithm are the same as the K-means, but its internal implementation is more complex.
defBikmeans (DataSet, K, distmeas=disteclud): M = shape (DataSet) [0] clusterassment = Mat (Zeros ((m,2))) centroid0 = Me An (DataSet, axis=0). ToList () [0] centlist =[CENTROID0]#create a list with one centroidFor j in Range (m):#calc initial Errorclusterassment[j,1] = Distmeas (Mat (CENTROID0), dataset[j,:]) **2 while (Len (centlist) < K): Lowestsse = inf For I in range (len (centlist)): Ptsincurrcluster = Dataset[nonzero (clusterassment[:,0]. A==i) [0],:]#get The data points currently in cluster ICentroidmat, Splitclustass = Kmeans (Ptsincurrcluster, 2, distmeas) Ssesplit = SUM (splitclustass[:,1])#compare the SSE to the currrent minimumSsenotsplit = SUM (Clusterassment[nonzero (clusterassment[:,0). a!=i) [0],1])Print"Ssesplit, and Notsplit:", Ssesplit,ssenotsplitif(Ssesplit + ssenotsplit) < Lowestsse:bestcenttosplit = I bestnewcents = Centroidmat Bestclustass = Splitclustass.Copy() Lowestsse = Ssesplit + ssenotsplit Bestclustass[nonzero (bestclustass[:,0]. A = = 1) [0],0] = Len (centlist)#change 1 to 3,4, or whateverBestclustass[nonzero (bestclustass[:,0]. A = = 0) [0],0] = BestcenttosplitPrint' The Bestcenttosplit is: ', bestcenttosplitPrint' The Len of Bestclustass is: ', Len (bestclustass) Centlist[bestcenttosplit] = bestnewcents[0,:].tolist () [0]#replace a centroid with the best centroidsCentlist.append (bestnewcents[1,:].tolist () [0]) Clusterassment[nonzero (clusterassment[:,0]. A = = bestcenttosplit) [0],:]= Bestclustass#reassign new clusters, and SSE returnMat (centlist), clusterassment
Other machine learning algorithms:
Supervised learning--stochastic gradient descent algorithm (SGD) and batch gradient descent algorithm (BGD)
Supervised learning--decision tree theory and Practice (i): Classification decision tree
Supervised learning--decision tree theory and Practice (bottom): Regression decision tree (CART)
Supervised learning--k proximity algorithm and digital recognition practice
Supervised learning--the theory and practice of naive Bayesian classification
Supervised learning--logistic Two classification (python)
Supervised learning--adaboost meta-algorithm to improve classification performance
Reference:
"Machine learning Combat"
Unsupervised learning--k-mean clustering algorithm for non-labeled data grouping