Analysis of "original" clustering thought

Source: Internet
Author: User

Clustering algorithm is often used in data mining, and the idea is simple and direct.

In the system, oneself also implemented several clustering algorithm, does the targeted optimization also does not have it difficulty.

Because of the simplicity of its way, it has not been thought of in depth in the beginning.

But if you want the data to speak for itself, you can't do without clustering.

So a lot of clustering algorithms are researched, and some conclusions are made.

--------------------------------------------------------------------------------------------------------

Clustering algorithms are broadly divided into four categories

1. Hierarchical clustering (top-down and bottom-up and improved)

2. Classification of clusters (Kmeans and improvement)

3. Density clustering (DBSCAN, SNN and improvement)

4. Fuzzy clustering (FCM and improved)

Of course they are not naturally segmented, and they intersect each other. Of course, exactly how to divide is the same thing. This classification method is divided according to the core of the cluster .

As I see it, the relationship between them is as follows:

Because clustering is non-supervised, it is non-conscious. However, we still want to improve the effect of clustering by using as much priori as possible .

That is, the clustering algorithm still wants to give it a direction, which in our usual way is to tell it: in fact, there are several classes of the data set (including the noise point itself).

So, that's what I'm trying to say. The clustering and fuzzy clustering are classified as Apriori clusters .

But the reality is often, so big data sets, ghosts know several classes.

Yes, no one knows. But we want the data to tell us ourselves.

There is no priori, what does the data tell us?

Here I give two examples:

1) image

Compress the color of a picture to 0/1 black and white after the following.

Just to show you this picture, how many classes would you divide?

You might say 1 classes, 2 classes, and 3 classes are all possible.

Yes, you don't know, how do you know that data?

2) Text-text clustering is a common thing. Let's say it's a simple word-bag model.

Here is the text document vector (can be imagined as n-dimensional).

In the database you actually see this is a bunch of vector values, then this time you how to divide?

Perhaps the structure of this document set is:

1. I like the Apple phone.

2. I like Samsung phones.

The documentation set is 100 reviews of mobile phones, of which 50 are subject to Apple and 50 subject to Samsung.

So, if the user wants to know the situation of both phones, that's two kinds. If the user wants to know the mobile phone situation, that is a kind.

So how do you divide your data?

The data is not known.

Therefore, for different application scenarios, it is necessary to artificially add a priori to improve the clustering effect.

Kmeans and FCM have a fixed priori k, which leads to the existence of a priori clustering optimization goal, so even though Kmeans has many problems , its clustering effect is still good.

Optimization objectives:

This is why Kmeans each iteration updates the center point as the mean within the class.

Write so much, not to say that a priori clustering than spontaneous clustering, spontaneous clustering in fact there is a certain degree of apriori (all kinds of thresholds, pure Apriori).

The important thing is to face different problems and need to use a targeted approach. The important thing is to say it again.

The lack of universality is caused by the limitations of the clustering algorithm itself.

There seems to be a suspicion that there has been a priori clustering, and here is an example of the application scenario of spontaneous clustering.

Image compression

RGB (255,255,255) ternary group, but less space so much, the 8bit*3 compressed into 4bit*3 (simple words to do hash mapping).

Before compression:

After compression:

Finally, the non-supervision of the cluster determines its universality. Therefore, it is very important to use various clustering algorithms flexibly according to the scene.

After that, we will write a comparison of the advantages and disadvantages of various clustering algorithms ~

No article can be reproduced without the permission of the blogger.

Analysis of "original" clustering thought

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.