Basic introduction of cluster analysis (i.)

Source: Internet
Author: User

          Cluster analysis is one of the most common algorithms in intelligent algorithms. From the perspective of human understanding, people from small to large in learning, are in the knowledge of new things, we are small when we have the ability to cluster, when you have eaten apples, after watching oranges, often see a companion to eat things and envy unceasingly, perhaps sometimes you do not know what it is, but you know that is can eat, This in mind has the ability to distinguish between eating and playing, but this kind of clustering is not accurate enough, through the accumulation of years, you know the type of fruit, so gradually you will be the classification of different things the higher the accuracy. Here to note, it is different from the classification, clustering does not require pre-constrained rules, for a popular example, you know that people can be divided into men and women, of course, this is only a classification, and now there is a person, she can conceive, no doubt, you will be a woman this crowd, by classification, you know she is a woman. However, clustering does not know that people can be divided into men and women, it is only through some data to find that some people can be pregnant, some people are not pregnant, but can not be defined as a man or a woman, this is a simple classification and clustering, generally speaking, clustering is an unsupervised technology, and classification is a supervised technology.

In short, clustering is about dividing objects into clusters, so that objects of the same cluster are as similar as possible, and the objects of different clusters are as dissimilar as possible. How to measure the similarity and the degree of dissimilarity. The common use of similarity to measure, in the similarity calculation before the first need to extract the characteristics of the data, for example, we have to consider the price of a residential property, then the total area of housing and total price is not a direct factor, we should use the average price of each price as a reference factor, its secondary extraction characteristics, Some data is noise point, that is unreasonable data, then need to filter as far as possible, otherwise the results of the calculation will have an impact, such as in the K-Means clustering algorithm (later will be slowly explained), the impact is relatively large. Again is the characteristics of the normalization of data processing, such as the average price of the housing use of different measures to standardize, and then for example, the collection of property fees, some of the monthly fee per square meter, and some of the monthly per household how much money; there is also a data is for the discretization of data and normalization of processing, If the developers of different communities need to set a level a,b,c for the developer, or to score, the score needs to deal with further normalization, why should be normalized, because different people on their view of the standard is not the same, for example, You buy things on the Taobao. The evaluation of goods and other people will be more or less different, at this time need to further deal with, and finally need to effectively select the characteristics, can describe the price of the plot of information factors certainly many, such as traffic, education, surrounding ecological environment, some factors have little impact, need to worry about, otherwise too many factors, Make the high dimension of data, form "disaster".

For the computation of similarity, there are characteristic projection and editing distance method. For feature projection, it is to map the data to the feature space, and the distance between objects in the feature space is in-place similarity. If the building type of the community has high-level, small high-level, and so on, it can be used as a coordinate dimension, the traffic Index and education index of different communities are also different, the traffic index and education index can also be used as a dimension, by mapping the different cells to the feature space, you can measure the distance in n-dimensional space such as Euclidean distance, etc.). For the editing distance, as the name implies, to start with an object, edit the cost of converting to another object, such as the transformation of cell A to cell B, you need to change the developer, need to change the traffic index, other factors are the same, the editing distance is D (A, B) = 2.

In practice, however, the similarity calculation is not so simple, because the factors considered are different to the results will have an impact, the manner of treatment will also have an impact, these factors can be divided into discrete, continuous and two value type, and some are affected by the measurement scale, such as sequential, interval, ratio type

After understanding the basic concepts of clustering, the next section will focus here on how to calculate the distance of the object's similarity in clustering.


--------------------------------------------------------------------------------------------------------------- ------------------

--------------------------------------------------------------------------

Author:james Yan

Date:2011-9-15

from:http://blog.csdn.net/zhouyan8603

Note:all references should be cited

--------------------------------------------------------------------------

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.