[Data Mining Course notes] unsupervised learning-clustering (clustering)

Last Update:2015-11-04 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

What is clustering (clustering)

Personal Understanding: Clustering is a large number of non-tagged records, according to their characteristics to divide them into clusters, the final result should be the same cluster between the similarity to be as large as possible, the similarity between different clusters to be as small as possible.

The clustering method is categorized as follows:

First, how to calculate the distance between samples?

Possible types of sample properties are: Numeric, named, Boolean ... When calculating the distance between samples, different types of attributes need to be calculated separately, and finally uniformly added to get the distance between the two samples. The data calculation methods for different types of properties are described below.

For all continuous numerical samples, first of all, for a property with a large difference in value, it should be normalized to transform the data so that it falls into a smaller common interval.

The Standardized approach:

1. Maximum-Minimum Normalization

where vi means that the value of the attribute on A is recorded in article I, Mina represents the minimum value on this property, New_maxa represents the right boundary of the interval we want to map to, and so on.

2. Z-score Normalization

Two of these parameters represent mean and variance, respectively.

3. Decimal Calibration Normalization

Normalized by moving the decimal position of attribute a. The number of decimal places to move depends on the maximum absolute value of a.

After normalization, the distance between the two samples can be calculated and the formula is as follows:

if each attribute has a different weight, the formula is modified as follows:

For all Boolean samples, the calculation is as follows:

The table above represents the number of attributes with different samples, the number of properties for which the Boolean type is 1, and the number of properties of 0, 1 and 0, respectively, and their distances are computed as follows:

The meaning of this formula is actually the ratio of the number of attributes between two samples to the number of different properties.

For the named type (nominal variable), a simple distance calculation formula is:

if the property type of the sample set is mixed, the following formula can be used to calculate the distance:

Where the denominator is the weight of the property.

Partitional Clustering

main idea: First man decides to divide the data set into K-clusters, then according to the similarity of the cluster to be as large as possible, the similarity between clusters should be as small as possible, the sample is divided into different clusters.

1. K-means Clustering

Algorithm process: The K clusters are randomly assigned a central point, then the distance between each data in the sample set and the K center point is computed, and the data is classified as the cluster with the smallest distance. After scanning the round, re-computes the center point of the cluster according to the samples in each cluster, and then scans the sample set, according to the new center point, calculates the distance between the sample and the center point, thus updating the samples in K clusters. After several iterations, if the recalculated center point is the same as the original center point, the iteration is stopped.

principle:

Where L is a cluster, W (i,l) value is 0, 1 indicates whether the sample I is in the L cluster, D is the calculation of the distance between the I sample and the L center point.

Our goal is to find W (i,l) so that the value of the P function is minimized. Obviously, the time complexity of the exhaustive method is too high to be done.

We use the gradient descent method to solve this problem and descend in the gradient direction to get the local optimal solution.

[Data Mining Course notes] unsupervised learning-clustering (clustering)

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

[Data Mining Course notes] unsupervised learning-clustering (clustering)

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

[Data Mining Course notes] unsupervised learning-clustering (clustering)

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support