Analysis of "original" clustering thought

Last Update:2015-07-23 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Clustering algorithm is often used in data mining, and the idea is simple and direct.

In the system, oneself also implemented several clustering algorithm, does the targeted optimization also does not have it difficulty.

Because of the simplicity of its way, it has not been thought of in depth in the beginning.

But if you want the data to speak for itself, you can't do without clustering.

So a lot of clustering algorithms are researched, and some conclusions are made.

--------------------------------------------------------------------------------------------------------

Clustering algorithms are broadly divided into four categories

1. Hierarchical clustering (top-down and bottom-up and improved)

2. Classification of clusters (Kmeans and improvement)

3. Density clustering (DBSCAN, SNN and improvement)

4. Fuzzy clustering (FCM and improved)

Of course they are not naturally segmented, and they intersect each other. Of course, exactly how to divide is the same thing. This classification method is divided according to the core of the cluster .

As I see it, the relationship between them is as follows:

Because clustering is non-supervised, it is non-conscious. However, we still want to improve the effect of clustering by using as much priori as possible .

That is, the clustering algorithm still wants to give it a direction, which in our usual way is to tell it: in fact, there are several classes of the data set (including the noise point itself).

So, that's what I'm trying to say. The clustering and fuzzy clustering are classified as Apriori clusters .

But the reality is often, so big data sets, ghosts know several classes.

Yes, no one knows. But we want the data to tell us ourselves.

There is no priori, what does the data tell us?

Here I give two examples:

1) image

Compress the color of a picture to 0/1 black and white after the following.

Just to show you this picture, how many classes would you divide?

You might say 1 classes, 2 classes, and 3 classes are all possible.

Yes, you don't know, how do you know that data?

2) Text-text clustering is a common thing. Let's say it's a simple word-bag model.

Here is the text document vector (can be imagined as n-dimensional).

In the database you actually see this is a bunch of vector values, then this time you how to divide?

Perhaps the structure of this document set is:

1. I like the Apple phone.

2. I like Samsung phones.

The documentation set is 100 reviews of mobile phones, of which 50 are subject to Apple and 50 subject to Samsung.

So, if the user wants to know the situation of both phones, that's two kinds. If the user wants to know the mobile phone situation, that is a kind.

So how do you divide your data?

The data is not known.

Therefore, for different application scenarios, it is necessary to artificially add a priori to improve the clustering effect.

Kmeans and FCM have a fixed priori k, which leads to the existence of a priori clustering optimization goal, so even though Kmeans has many problems , its clustering effect is still good.

Optimization objectives:

This is why Kmeans each iteration updates the center point as the mean within the class.

Write so much, not to say that a priori clustering than spontaneous clustering, spontaneous clustering in fact there is a certain degree of apriori (all kinds of thresholds, pure Apriori).

The important thing is to face different problems and need to use a targeted approach. The important thing is to say it again.

The lack of universality is caused by the limitations of the clustering algorithm itself.

There seems to be a suspicion that there has been a priori clustering, and here is an example of the application scenario of spontaneous clustering.

Image compression

RGB (255,255,255) ternary group, but less space so much, the 8bit*3 compressed into 4bit*3 (simple words to do hash mapping).

Before compression:

After compression:

Finally, the non-supervision of the cluster determines its universality. Therefore, it is very important to use various clustering algorithms flexibly according to the scene.

After that, we will write a comparison of the advantages and disadvantages of various clustering algorithms ~

No article can be reproduced without the permission of the blogger.

Analysis of "original" clustering thought

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Analysis of "original" clustering thought

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Analysis of "original" clustering thought

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support