Unsupervised learning: Focus on discovering the distribution characteristics of the data itself (no need to tag data) save a lot of human data scale is limitless
1 Discovery Data Community data clustering can also look for outlier samples
2 features reduced dimension preserving data with differentiated low-dimensional features
These are very useful techniques in mass data processing.
Data clustering
K-Means algorithm (the number of preset clusters is constantly updating the cluster center iteration, which is the sum of the squares of all data points to their cluster centers and tends to stabilize)
Process
① first randomly lays out the points in the K-proof space as the initial cluster center
The ② then looks for the nearest one from the K Cluster Center for the extra-long vectors based on each data and marks the data as subordinate with this cluster center
③ then, after all the data has been labeled, the cluster centers are re-calculated based on the newly allocated clusters of these data.
④ If a round down all data dependent cluster centers with the last allocated class cluster does not change then the iteration can stop or return to ② to continue the loop
Example of using the K-mans algorithm on handwritten digital image data
ImportNumPy as NPImportMatplotlib.pyplot as PltImportPandas as PD fromSklearn.clusterImportKmeans#use Panda to read training datasets and test data setsDigits_train = Pd.read_csv ('Https://archive.ics.uci.edu/ml/machine-learning-databases/optdigits/optdigits.tra', header=None) Digits_test=pd.read_csv ('https://archive.ics.uci.edu/ml/machine-learning-databases/optdigits/optdigits.tes', header=None)#64-dimensional pixel features and 1-dimensional digital targets are separated from the training and test data setsX_train=digits_train[np.arange (64)]#Np.arangeY_train=digits_train[64]x_test=digits_test[np.arange (64)]y_test=digits_test[64]#Initialize the Kmeans model and set the number of cluster centers to tenKmeans=kmeans (n_clusters=10) Kmeans.fit (x_train,y_train) y_predict=kmeans.predict (x_test)#K-means Clustering Performance assessment using Ari fromSklearnImportMetricsPrint(Metrics.adjusted_rand_score (y_test,y_predict))
Performance evaluation:
① is used to evaluate the data itself with the correct category information using the ARI Ari indicator is similar to the method of calculating accuracy in the classification problem, while also taking into account the problem that the cluster cannot match the classification mark one by one
② if the data being used for evaluation does not have a category, then we are accustomed to using contour coefficients to measure the quality of the clustering results. The contour factor also takes into account the aggregation degree and the degree of separation of the cluster.
Used to evaluate the effect of clustering and take a range of values [ -1,1]. The larger the value of the contour system, the better the clustering effect.
Python machine learning and practice Coding unsupervised learning classical model data clustering and feature reduction