R language and data analysis four: Clustering algorithm 1

Source: Internet
Author: User

The classification algorithm previously shared with everyone belongs to supervised learning classification algorithm, and today we continue to share unsupervised learning classification algorithm---Clustering algorithm with small partners. Clustering algorithms also have the flavor of big Data mining

Clustering algorithm is essentially based on the geometric distance to the standard algorithm, the most suitable for the data is spherical problem, first listed under the commonly used distance:

Absolute distance (also known as checkerboard distance or city block distance)


Euclide distance (Euclidean distance, universal distance)


Minkowski distance (Minkowski distance), Euclidean distance (q=2), Absolute distance (q=1) and Chebyshev distance (q= infinity), these are the special cases of Minkowski.


Chebyshew (Chebyshev) distance



Mahalanobis distance (the distance we introduced earlier, review here)


Lance and Wiliams distance


Qualitative variable distance (binary distance):



Where M1 is the total number of 1-1 pairs in all projects, M2 is the total number of pairs that are not paired

In the R language, there are ready-made functions for the distances mentioned above, which can only be formulated with parameters, as follows:

Dist (x,method= "Euclidean", diag=false,upper=false,p=2) #欧式距离dist (x,method= "Maximum", diag=false,upper=false,p=2) # Chebyshev from Dist (x,method= "Manhattan", diag=false,upper=false,p=2) #绝对值距离dist (x,method= "Canberra", diag=false,upper= false,p=2) #Lance距离; Dist (x,method= "binary", diag=false,upper=false,p=2) #d定性变量距离

After the distance, the original data needs to be processed accordingly, otherwise the difference of data size will be mistaken by the algorithm attribute weight value and affect the final clustering effect. We need to do a centralized and standardized transformation of the data. Centering means that all data is 0 central, and a common standard practice is to treat data as a normal distribution, by dividing the data by the standard deviation to reach the standardization role. After the data is centralized and standardized, the value of each variable is close to each other so that each variable can play a decisive role equally.

The corresponding functions are provided in the R language:

Scale (X,center=true,scale=true)

Next, we introduce the oldest clustering algorithm--hierarchical clustering method.

Thought:

STEP1: Each sample is used as a class; STEP2: Calculates the distance between classes; Step3: Merges the shortest two classes into a new class; STEP4: Repeat 2-3 steps, that is, to merge the last two classes continuously, reducing one class at a time until all the samples are merged into a category;

Next, we introduce the following classes of distance between:

Shortest distance method: The minimum value in a distance of two classes;

Longest Distance method: The maximum value of a distance of two classes;

Median distance method: Take the distance between the minimum and maximum values;

Class averaging method: Take the evaluation value of all distances;

Center of gravity Method: the distance between the center of gravity (mean value);

Deviation method and method: Sum of squares of difference between each item and the evaluation term;

R provides a hierarchical clustering method for the Hclust () function, and here is a simple example to illustrate the following usage:

X<-c (1,2,6,8,11) Dim (x) <-c (5,1) d<-dist (x) hc1<-hclust (D, "single") #最短距离法hc2 <-hclust (D, "complete") # The longest distance law hc3<-hclust (d, "median") #中间距离法hc4 <-hclust (D, "Ward") #离差平方和法cpar <-par (Mfrow=c (2,2)) plot (hc1,hang=-1 Plot (Hc2,hang=-1) plot (hc3,hang=-1) plot (hc4,hang=-1) par (CPAR) rect.hclust (hc1,k=2) #人工辅助告诉R分为几大类

The results are as follows: The red partition of the first plate that tells R to be divided into several classes, R automatically draws the display of the classification:



I'll share it with my little friends today, and we'll continue with the next chapter.




R language and data analysis four: Clustering algorithm 1

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.