[Data Mining Course notes] unsupervised learning-clustering (clustering)

Source: Internet
Author: User

What is clustering (clustering)

Personal Understanding: Clustering is a large number of non-tagged records, according to their characteristics to divide them into clusters, the final result should be the same cluster between the similarity to be as large as possible, the similarity between different clusters to be as small as possible.

The clustering method is categorized as follows:

First, how to calculate the distance between samples?

Possible types of sample properties are: Numeric, named, Boolean ... When calculating the distance between samples, different types of attributes need to be calculated separately, and finally uniformly added to get the distance between the two samples. The data calculation methods for different types of properties are described below.

For all continuous numerical samples, first of all, for a property with a large difference in value, it should be normalized to transform the data so that it falls into a smaller common interval.

The Standardized approach:

1. Maximum-Minimum Normalization

where vi means that the value of the attribute on A is recorded in article I, Mina represents the minimum value on this property, New_maxa represents the right boundary of the interval we want to map to, and so on.

2. Z-score Normalization

Two of these parameters represent mean and variance, respectively.

3. Decimal Calibration Normalization

Normalized by moving the decimal position of attribute a. The number of decimal places to move depends on the maximum absolute value of a.

After normalization, the distance between the two samples can be calculated and the formula is as follows:

if each attribute has a different weight, the formula is modified as follows:

For all Boolean samples, the calculation is as follows:

The table above represents the number of attributes with different samples, the number of properties for which the Boolean type is 1, and the number of properties of 0, 1 and 0, respectively, and their distances are computed as follows:

The meaning of this formula is actually the ratio of the number of attributes between two samples to the number of different properties.

For the named type (nominal variable), a simple distance calculation formula is:


if the property type of the sample set is mixed, the following formula can be used to calculate the distance:

Where the denominator is the weight of the property.

Partitional Clustering

main idea: First man decides to divide the data set into K-clusters, then according to the similarity of the cluster to be as large as possible, the similarity between clusters should be as small as possible, the sample is divided into different clusters.

1. K-means Clustering

Algorithm process: The K clusters are randomly assigned a central point, then the distance between each data in the sample set and the K center point is computed, and the data is classified as the cluster with the smallest distance. After scanning the round, re-computes the center point of the cluster according to the samples in each cluster, and then scans the sample set, according to the new center point, calculates the distance between the sample and the center point, thus updating the samples in K clusters. After several iterations, if the recalculated center point is the same as the original center point, the iteration is stopped.

principle:

Where L is a cluster, W (i,l) value is 0, 1 indicates whether the sample I is in the L cluster, D is the calculation of the distance between the I sample and the L center point.

Our goal is to find W (i,l) so that the value of the P function is minimized. Obviously, the time complexity of the exhaustive method is too high to be done.

We use the gradient descent method to solve this problem and descend in the gradient direction to get the local optimal solution.

[Data Mining Course notes] unsupervised learning-clustering (clustering)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.