When the clustering (clustering) and classification (classification) are put together, it is easy to confuse the concepts of the two concepts, respectively, to explain the concept. 1 cluster (clustering):
The process of dividing a collection of physical or abstract objects into multiple classes consisting of similar objects is called clustering.
The general approach of clustering analysis is to determine the clustering statistics and then use the statistics to cluster the samples or variables. The method of clustering n samples is called Q-type clustering, and the commonly used statistic is called "distance"; the clustering method for m variables is called R-type clustering, and the commonly used statistic is called "similarity coefficient". Method Name Parameters Scalability UseCase Geometry (metric used) K-means (clusters) Very Large n_samples Th minibatch code general-purpose, even cluster size, flat geometry, not too many clusters distances between points A Ffinity propagation damping, sample preference not scalable with n_samples Many clusters, uneven cluster size, Non-flat ge Ometry graph distance (e.g. Nearest-neighbor graph) Mean-shift bandwidth not scalable withn_samples Many clusters, uneven Cluster size, Non-flat geometry distances between points spectral clustering number of clusters medium n_samples, SMA Ll n_clusters Few clusters, even cluster size, Non-flat geometry graph distance (e.g. Nearest-neighbor graph) Ward hi Erarchical Clustering number of clusters large n_samples andn_clusters Many clusters, possibly connectivity constraints Distances between points agglomerative clustering number of clusters, linkage type, distance Large n_samples an Dn_clusters Many Clusters, possibly connectivity constraints, non Euclidean distances-any pairwise distance DBSCAN Neighbo RHood size Very large n_samples, medium n_clusters non-flat geometry, uneven cluster sizes distances between Rest Points Gaussian Mixtures many not scalable Flat, good for geometry density estimation Mahalanobis to center S Birch branching factor, threshold, optional global clusterer. Large n_clusters andn_samples Large DataSet, outlier removal, data reduction. Euclidean distance between points
2 Classification (classification):
Under the existing classification criteria, the new data are divided and classified.
Common classification algorithm:
Naive Bayesian (Naive Bayes, NB)
Super simple, just like doing some number of work. If the conditional independent hypothesis is established, NB will converge faster than the discriminant model (such as logistic regression), so you only need a small amount of training data. Even if the conditional independent assumption is not tenable, NB still behaves surprisingly well in practice. NB is worth trying if you want to do something like semi-supervised learning, or if you have a simple model and good performance.
Logistic regression (logistic regression, LR)
LR has many methods to regularization the model. LR does not need to consider whether the sample is relevant, rather than the conditional independence assumption of NB. Unlike decision tree and support vector Machine (SVM), NB has a good probability interpretation, and it is easy to use new training data to update the model (using the online gradient descent method). LR is worth using if you want some probability information (for example, to make it easier to adjust the classification thresholds, to get the uncertainty of the classification, to get the confidence interval), or to update the improved model easily if you want more data in the future.
Decision Trees (Decision tree, DT)
DT is easy to understand and explain (to some people--not sure if I'm in them either). DT is nonparametric, so you don't have to worry about whether the outliers (or outliers) and the data are linearly measurable (for example, DT can handle this situation easily: The feature x values of a Class A sample are often very small or very large, while the characteristic x values of the samples belonging to Class B are in the middle range. The main disadvantage of DT is that it is easy to fit, which is why integrated learning algorithms such as Random forest (Random Forest, RF) (or boosted tree) are presented. In addition, the RF in many classification problems often the best (I personally believe that generally better than the SVM), and faster scalability, and not like SVM to adjust a large number of parameters, so the recent RF is a very popular algorithm.
Support Vector machines (Support vector Machine, SVM)
High classification accuracy, good theoretical guarantee for the fitting, selection of the appropriate kernel function, the face of the problem of linear can not be divided into a good performance. SVM is very popular in text categorization, which is usually very high in dimensionality. Because of the large memory requirements and cumbersome tuning, I think the RF has begun to threaten its status.
Back to LR and DT (I'm more inclined to be LR vs. rf), make a simple summary: Both approaches are fast and scalable. In terms of accuracy, RF is better than LR. But LR can be updated online and provide useful probabilistic information. Given that you are in square (unsure of what the scientist is, should not be an interesting incarnation), may engage in fraud detection: If you want to adjust the threshold quickly to change the false positive rate and false negative rate, the classification results include probabilistic information will be helpful. No matter what algorithm you choose, if your sample numbers are uneven (often occurring in fraud detection), you need to resample the various data or adjust your error metrics to make each category more balanced. 3 examples
Suppose there is a group of people of the age of data, roughly known among them a bunch of children, a bunch of young people, a bunch of old people.
Clustering is the automatic discovery of these three stacks of data and the aggregation of similar data into the same heap. So for this example, if you want to get together in 3 stacks, then the input is a pile of age data, note that at this time the age data does not have a class label, that is to say I only know there are roughly three people inside, as to who is the heap, now is not know, and output is the class of each data category, the completion of the cluster, Just know who and who is a bunch of.
And the classification is, I told you beforehand, children, young people and the age of the elderly is what kind of, now a new age, the output of its class label, is that it belongs to children, young people, the elderly of which class. In general, the classifier needs to be trained, which is to tell you the algorithm, the characteristics of each class, it can recognize the new data.
What we have just mentioned is a super simple example to facilitate understanding. Let me give you a practical example.
For clustering, for example, some search engines have "view similar Web page" function, this can be done with clustering, the Web page on the line clustering, in the results of clustering, each of the pages in the class as a similar.
For classification, such as handwriting recognition, you can see the classification problem, such as I wrote 10 "I" word, then the 10 "I" character extraction, you can tell the algorithm, "I" character has what kind of characteristics, so came a new "I" word, although the stroke and the previous 10 "I" word is not exactly the same, But the characteristic height is similar, so the handwriting word classifies to "I" this class, then recognizes.
Reference: [1] Baidu Encyclopedia [2] http://www.zhihu.com/question/24169940/answer/26952728