Clustering _ July Algorithm April Machine Workshop 10th course notes

Source: Internet
Author: User

2016/5/23 Monday 11:00
Desc
Core business of each company E-commerce mainly do the recommended search for the main CTR image, the main application DL
Non-supervised PCA, SVD, clustering, GMM
Know Gaussian mixture model Gaussian mixture is: 1. is a unsupervised clustering method, and is a soft cluster, that is, each data number given is the probability of each class 2. Fitting random distribution of probability density function "viewpoint" using Gaussian distribution to fit arbitrary distributions
Application of Clustering algorithm Generally not as a separate task, because the result is not determined that clustering can produce some feature, such as the result of a user_id cluster, can be applied to some related clustering: 1. Image segmentation, for PS and other images processing, the same color of the area selected 2. Message Collation 3. Clustering of user purchase tracks, clustering of addresses, etc.
Calculate Fisher value A new indicator (Distance between classes ÷ distance within class) to characterize the degree of cohesion of the data
Feature Mapping in Clustering Clustering is not necessarily the clustering of convex functions, such as this can also be clustered to use the feature map, that is: the construction of the square term as a feature, and then you will find that the cluster can also be gathered like a classification
Clustering generally does not use cosine distances Many of the clustering libraries are not the cosine distance of European clustering: it is mathematically impossible to prove that the guarantee will converge because the center of the European distance is observable.
A method of fitting the probability density function Fit a probability distribution method of 1. The generalized representation method of arbitrary probability density satisfying certain constraints in the maximum entropy model is 2. Mixed Gaussian model, can fit arbitrary
Kmeans upgrade How to initialize sensitive situations 1. Using the k-means++ method, randomly initialize a point as the center point, then select Next center point, the next center point is the farthest from the previous center point, the Third center point is the distance from the front two points and the largest, that is, as far as possible to choose the distance between the two points. 2. Multiple initialization several times, and then choose one of the least loss function is generally two forms: 1. Random initialization of the center point, not necessarily 2. Select some points in data set as the initial center point
Kmeans Upgrade K value how to select 1. The elbow point method for each cluster loss function is plotted, and then look for the elbow point, as the most suitable K-elbow point before the K less, the loss value is large (because the limit of a point a loss of 0); After the elbow point of the K, but because the number of clusters increased, so do not select because so as to choose the category Progressive culling method of experience: First gather class, some cluster class data is very few, such as only 2, 3, so you need to reject the M class, and then follow the method of k ' = K-m clustering is: You can gather 2000, and then the above to subtract, and then follow the K-m method of clustering again 3. With the help of other feature clustering such as commodity clustering, not according to the image of the picture pixels directly lost, but first look at the text of the product to gather the class, and then can be sliced first One-hot first with the text cluster for 200 to know K value is so obtained, before the interview is said poor lifting, with the elbow point method 4. Use k as a penalty factor because our original intention is to minimize the number of clusters, that is, the limit of a data class, is meaningless, but k less into a class, it is meaningless, at this time λ needs experience, not cross-validation because K is not the extreme value of the two ends, and Λk only the constraints cannot be too big
When the Kmeans upgrade terminates All say until convergence, but this is vague statement, the specific indicators should be as follows: 1. The sum of the distances of each cluster point to its own center point, this scalar no longer changes 2. K Center points no longer changing # The second method, which needs to compare k*n variables, is not as simple as the first method
Kmeans upgrade Kmeans sensitive to anomaly points Using the K-median method each time the center is a point in the original data that this method can effectively solve the problem of the anomaly
Kmeans Cluster Advanced Summary 1. When "Terminate": Data to center distance and no longer transform #这个计算量最小2. How to "initialize" the initial center point (because of initialization sensitivity): Try a few more times, pick the least loss, or 3 of the farthest from the last center point. How to "Determine K": Elbow Point Method 4. How to solve the "anomaly" problem? K Median
The loss function in Kmeans The rnk here is a matrix, like the Z-matrix in the EM algorithm, which is the output of the EM algorithm, which can be used to get the model output.
A parameter in the sklearning Sklearning in the Kmeans by default are clustered 10 times, select the minimum loss function, it itself is not sure whether each initialization is good
The difference between Kmeans and kmeans++ Kmeans is the randomly initialized kmeans++ of each of the farthest K-point initialization
The difference between Kmeans and K-media Kmeans each time the cluster center is calculated, may not belong to a point Kmedian clustering, each time the center is a point in the original data that this method can effectively solve the problem of the anomaly
Common problems of clustering methods 1. Initial point sensitivity (try several times, choose Loss min; k-means++) 2. Anomalous point sensitivity (improved version of K-median) 3. You need to manually specify the number of clusters (hierarchical clustering or elbow point method or sequential culling method) 4. Powerless for non-convex datasets (feature mapping may be required)
Hierarchical clustering There are two ways: top down and bottom up each merge only two classes namely: must be done eventually or the initial situation is a point of a class of genetic engineering with hierarchical clustering because it is time-consuming, so other projects do not use this method 80W product of that, you can not hierarchical clustering ah because it is a hierarchical clustering To each of the data a class
General Methods of Clustering Kmeans GMM The rest is: hierarchical clustering, spectral clustering
About visualization requires dimensionality reduction Reduced dimension to two dimensions to visualize
Flat Clustering is not a tree-like cluster relative to hierarchical clustering
An inspiration Because the number of clusters as few as possible, because the limit of a data class is not meaningful to the respective center of the distance of the sum plus k^2 as regularization, to select the best K because the K is too large, the loss will be reduced; Even if you choose the appropriate λ, so that the loss of the smallest, then the clustering results and K number must be the best? Criticism: Because K too much, a data class is meaningless k too little, all data a class, it is meaningless that K should be moderate, so λ K is only a constraint K can not be too big, but not k the smaller the better, so it does not yo constraints K can not be too small, so this method is not suitable then can construct a convex function: f (k)-positive infinity, when the k->0f (k) and positive infinity, when K->n may try
Jieba participle The previous use of the maximum forward and back matching is now using the HMM method, that is, S M E as the label of the word segmentation
Classification of clusters 1. Soft clustering and hard clustering Kmeans belong to hard clustering, that is, a point can belong to only one class GMM can be a soft cluster, a point may belong to more than one class (by probability) 2. Flat clusters and clustered clusters of trees, such as Kmeans, are clustered into a given tree-like cluster, such as a hierarchical distance, and are clustered into several categories.
The EM algorithm in GMM M Step: Objective function is, after derivative has: WI = ni/n # Weight of each Gaussian distribution μi =σxi/ni Xi∈ Class I # mean Ōi =σ (xi-μi) · (xi-μi) ^t xi∈ Class I # Variance E step: Under a given θ= (w,μ,σ), data belongs to each cluster probability p (cluster = i | xj,θ), i.e. Q function, which is the implicit variable matrix Z in EM

Clustering _ July Algorithm April Machine Workshop 10th course notes

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.