Clustering Algorithm learning notes (I)-Basics

Last Update:2018-12-07 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

1. Cluster Definition

"Clustering divides similar objects into different groups or more subsets by means of static classification (Subset),In this way, the member objects in the same subset have similar attributes ."--Wikipedia

"Clustering analysis refers to the process of grouping a set of physical or abstract objects into multiple classes composed of similar objects. It is an important human action. Clustering is a process of classifying data into different classes or clusters. Therefore, objects in the same cluster have great similarity, objects in different clusters have very different characteristics."-- Baidu encyclopedia

To put it bluntly, clustering (Clustering) Is a process that can be understood literally-aggregating identical, similar, similar, and related object instances into one type. If a data set containsNInstances. according to certain rulesNInstances are dividedMCategories, the instances in each category are related, and the differences between different categories are irrelevant. This process is called clustering.

Clustering Process

When we know what clustering is, what we want to do next is how to perform clustering. This is explained in detail in the teaching material and I understand it myself:

1) Feature Selection(Feature selection): Like other classification tasks, features are often the basis of all activities. It is important to select features to express the information to be classified as much as possible. Highly expressive features will affect the clustering effect. This will be shown in future experiments.

2) Nearest neighbor Measure(Proximity measure): After feature expression of the Instance vector is selected, how can we determine the similarity between the two instance vectors? This is a very critical issue and also has a decisive significance in the clustering process, because the essence of clustering is to distinguish between similarity and similarity, and the nearest neighbor measure is a definition of this similarity.

3) Clustering Principle(Clustering criterion): It is not enough to define the similarity. In combination with the nearest neighbor measure, how to judge similarity is the key. Intuitively understand the concept of clustering criterion is the clustering condition when clustering is performed and when clustering is not performed. When we use ClusteringAlgorithmDuring computing, clustering is a concern of algorithms, and clustering requires a standard. Clustering criteria are the standard. (Speaking of standards, it's scary enough.Pai_^)

4) Clustering Algorithm(Clustering Algorithm): You don't need to go into details about this. This is the top priority of the entire learning process. I will not talk about the core things here. I will talk about it later. I will start the clustering process by using the neighbor measure and clustering criterion.

5) Result Verification(Validation of the Results): In fact,PRThe author proposed that this process should also be put into the cluster task process. I think it is a bit redundant, because it should be put at the algorithm level to verify the correctness of the algorithm.4) And5. Because the algorithm is correct and the verification is poor, it is a feature of the algorithm. (Who has designed an algorithm to prove it)

6)(Interpretation of the results): Chinese VersionPRThe translation is the result determination, and I feel that the literal meaning is the result interpretation. (Clustering will eventually divide the dataset into several classes. There must be principles before doing things, and there must be explanations after doing things. This is the explanation. It may be better to have a self-circular structure.Pai_^)

The details of the entire clustering task will be described in detail later. Here we will elaborate on the clustering principle (although I feel that I have already mentioned it in detail ). For example, there is such a dataset.XContains the basic information and scores of the four students.

Name	Grade	Class	Mathematical score
Zhang San	1	2	99
Li Si	2	2	95
Zhang Fei	3	1	59
Zhao Yun	2	1	90

The clustering criterion is a classification standard. How to cluster such a data set in the example. Of course, there may be many clustering situations. For example 1 For classification, then the dataset X There are two types: { Zhang San }, { Li Si, Zhang Fei, Zhao Yun } If the score is divided by class, there are two types: { Zhang San, Li Si } , { Zhang Fei, Zhao Yun } If the score is based on the score (if the pass is 60 Points). There are two types: { Zhang San, Li Si, Zhao Yun } , { Zhang Fei } . Of course, the design of clustering principles is often complicated, depending on how you want to divide them. According to the geometric understanding of the classification idea, the dataset is equivalent to the sample space, and the number of features of the data instance (in this example 4 Features [ Name, grade, class, mathematics score ] ) Is equivalent to a spatial dimension, and the instance vector corresponds to a point in the space. Then the clustering criteria should be those magical superplanes (for mathematical function expressions, I personally think these functions are equivalent to the clustering criteria ), these superplanes split the data perfectly.

3. Clustering feature type

How to differentiate the features used in clustering? What types of requirements are there? Clustering features are divided by region into continuous and discrete features. The definition field corresponding to the continuous feature is the data space.RAnd discrete features correspond to discrete subsets. In addition, if a discrete feature only contains two feature values, this discrete feature is also called a binary feature.

According to the relative meaning of feature values, features can be divided into the following four types: Scalar (Nominal) , Ordered (Ordinal) , Range-Scale (Interval-scaled) And ratio scale (Ratio-scaled). Scalar features are used to encode the possible states of a type of features, such as gender of a person, male and female, and weather conditions such as overcast, clear, and rain. The sequential feature is similar to the scalar feature. It is also a series of State encoding, but it only has a slight constraint on the encoding, that is, the encoding sequence is meaningful. For example, for a dish, its features include: { Hard to eat, hard to eat, average, good to eat, delicious } Several values to define the status, but these statuses are sequential. I think this type of feature is a specific subset of a scalar feature, or a scalar feature with constraints. The interval scale feature indicates that the interval between the feature values is meaningful and the ratio of the value is meaningless. The typical example is temperature, A Geographic temperature ( 20 ℃) Ratio B Location ( 15 C) high 5 Degree, the interval difference here is meaningful, but you cannot say A Geographic Ratio B Geothermal 1/3This is meaningless. The ratio feature is opposite. The ratio is meaningful. The typical example is weight, C Heavy 100 GB , D Heavy 50 GB , Then C Ratio D Heavy 2 This is meaningful. (Of course C Ratio D Heavy 50 GB It is also possible, so it can be considered that the interval scale is a real subset of the ratio scale ).

In common applications, including programming implementations that we care about on a daily basisNominalFeatures andNumericFeature, whereNominalAvailableStringAndNumericAvailableNumber. (WEKAInAttributeIs defined in this way)

4. Application of cluster analysis

After talking about so many basic concepts, the most practical topic is application. Just like advertising for clustering, where can we use it. Like the legends I mentioned in the introduction, classification, as a basic activity of human identification objects, probably exists with human consciousness. It can also be said that one of the essential activities of human intelligent understanding is classification. In addition, researchers divide classification into supervised and unsupervised methods. clustering is the most commonly used method for unsupervised classification and is also an absolute representative method. Imagine that computers can automatically classify a group of data or a pile of information into several categories, which is absolutely necessary and meaningful for assisting human intelligence. Therefore, a core application of clustering is Data Mining and pattern recognition. In addition, as long as classification tasks are involved in various scientific fields, we all think of clustering.~~~(The first time I officially removed the cluster23The teaching building listens to an information-based course that seems to be automated by a professor ). The authoritative classification of scholars divides the application of clustering into four basic directions:1To remove redundant information from massive data.2) Hypothesis generation. We can perform Clustering Analysis on data to derive certain data properties.3In fact, the hypothesis test verifies the risk degree of a decision through clustering analysis.4) Grouping-based prediction, like all prediction tasks, after existing data is clustered and classified, new future data can be identified and predicted based on the same rules.

Clustering is widely used. I am too lazy to list it if it is enumerated by subject. As long as you know its principles and objectives, the application field will naturally understand.

5. Summary

The basic concept of clustering is that clustering has been studied and researched for decades. Fortunately, we can stand on the shoulders of many giants, how to Improve innovative and extended applications is our future goal. "To do something better, you must first sharpen the tool." clustering here is our tool.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Clustering Algorithm learning notes (I)-Basics

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Clustering Algorithm learning notes (I)-Basics

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support