R language and data analysis four: Clustering algorithm 1

Last Update:2014-12-10 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

The classification algorithm previously shared with everyone belongs to supervised learning classification algorithm, and today we continue to share unsupervised learning classification algorithm---Clustering algorithm with small partners. Clustering algorithms also have the flavor of big Data mining

Clustering algorithm is essentially based on the geometric distance to the standard algorithm, the most suitable for the data is spherical problem, first listed under the commonly used distance:

Absolute distance (also known as checkerboard distance or city block distance)

Euclide distance (Euclidean distance, universal distance)

Minkowski distance (Minkowski distance), Euclidean distance (q=2), Absolute distance (q=1) and Chebyshev distance (q= infinity), these are the special cases of Minkowski.

Chebyshew (Chebyshev) distance

Mahalanobis distance (the distance we introduced earlier, review here)

Lance and Wiliams distance

Qualitative variable distance (binary distance):

Where M1 is the total number of 1-1 pairs in all projects, M2 is the total number of pairs that are not paired

In the R language, there are ready-made functions for the distances mentioned above, which can only be formulated with parameters, as follows:

Dist (x,method= "Euclidean", diag=false,upper=false,p=2) #欧式距离dist (x,method= "Maximum", diag=false,upper=false,p=2) # Chebyshev from Dist (x,method= "Manhattan", diag=false,upper=false,p=2) #绝对值距离dist (x,method= "Canberra", diag=false,upper= false,p=2) #Lance距离; Dist (x,method= "binary", diag=false,upper=false,p=2) #d定性变量距离

After the distance, the original data needs to be processed accordingly, otherwise the difference of data size will be mistaken by the algorithm attribute weight value and affect the final clustering effect. We need to do a centralized and standardized transformation of the data. Centering means that all data is 0 central, and a common standard practice is to treat data as a normal distribution, by dividing the data by the standard deviation to reach the standardization role. After the data is centralized and standardized, the value of each variable is close to each other so that each variable can play a decisive role equally.

The corresponding functions are provided in the R language:

Scale (X,center=true,scale=true)

Next, we introduce the oldest clustering algorithm--hierarchical clustering method.

Thought:

STEP1: Each sample is used as a class; STEP2: Calculates the distance between classes; Step3: Merges the shortest two classes into a new class; STEP4: Repeat 2-3 steps, that is, to merge the last two classes continuously, reducing one class at a time until all the samples are merged into a category;

Next, we introduce the following classes of distance between:

Shortest distance method: The minimum value in a distance of two classes;

Longest Distance method: The maximum value of a distance of two classes;

Median distance method: Take the distance between the minimum and maximum values;

Class averaging method: Take the evaluation value of all distances;

Center of gravity Method: the distance between the center of gravity (mean value);

Deviation method and method: Sum of squares of difference between each item and the evaluation term;

R provides a hierarchical clustering method for the Hclust () function, and here is a simple example to illustrate the following usage:

X<-c (1,2,6,8,11) Dim (x) <-c (5,1) d<-dist (x) hc1<-hclust (D, "single") #最短距离法hc2 <-hclust (D, "complete") # The longest distance law hc3<-hclust (d, "median") #中间距离法hc4 <-hclust (D, "Ward") #离差平方和法cpar <-par (Mfrow=c (2,2)) plot (hc1,hang=-1 Plot (Hc2,hang=-1) plot (hc3,hang=-1) plot (hc4,hang=-1) par (CPAR) rect.hclust (hc1,k=2) #人工辅助告诉R分为几大类

The results are as follows: The red partition of the first plate that tells R to be divided into several classes, R automatically draws the display of the classification:

I'll share it with my little friends today, and we'll continue with the next chapter.

R language and data analysis four: Clustering algorithm 1

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

R language and data analysis four: Clustering algorithm 1

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

R language and data analysis four: Clustering algorithm 1

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support