R-language hybrid data clustering case study

Last Update:2018-07-26 Source: Internet

Author: User

Tags mixed random seed

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Using cluster analysis, we can easily see the distribution of samples in a data set. In the past, the article on cluster analysis usually only describes how to deal with continuous variables, which do not show too much how to deal with mixed data (such as containing continuous variables, nominal variables and sequential variables of the data). This paper will use Gower distance, PAM (partitioning around medoids) algorithm and contour coefficient to describe how to do cluster analysis of mixed data.

This paper is divided into three parts: selection of the number of selected clusters in the Distance computing clustering algorithm

For ease of presentation, this article uses the College dataset in the ISLR package directly. The data set contains 777 data from American universities since 1995, mainly with the following variables: Continuous variable admission rate tuition number classification variables public or private institutions are high-level institutions, that is, all new students graduated from the top 10% high school in the proportion of the number is greater than 50%

The R packages covered in this article are:

In [3]:

Set.seed (1680) # Set a random seed, making the results of this article a reproducible library (DPLYR) library (ISLR) library (
cluster) library
(Rtsne )
Library (GGPLOT2)

Attaching package: ' Dplyr '

The following objects is masked from ' package:stats ':

    filter, lag the

following O Bjects is masked from ' package:base ':

    intersect, Setdiff, setequal, union

Before building a clustering model, we need to do some data cleansing: The admission rate is equal to the number of admissions divided by the total number of applicants. Determine whether a school is a high-level institution, depending on whether the number of freshmen enrolled in the top 10% high schools in the school is greater than 50%.

In [5]:

college_clean <-College%>% mutate (name = Row.names (.), accept_rate = Accept/apps, Iselite = C
                       UT (TOP10PERC, breaks = C (0, +, +), labels = c ("Not elite", "elite"), Include.lowest = TRUE)%>% mutate (Iselite = factor (iselite))%>% Select (Name, Accept_rate, O Utstate, Enroll, Grad.rate, Private, iselite) Glimpse (College_clean)

observations:777
variables:7
$ name        (CHR) "Abilene Christian University", "Adelphi University", "...
$ accept_rate (dbl) 0.7421687, 0.8801464, 0.7682073, 0.8369305, 0.7564767, ...
$ outstate    (dbl) 7440, 12280, 11250, 12960, 7560, 13500, 13290, 13868, 1
... $ Enroll      (dbl) 721, MB, 336, 137,, 158, 103, 489, 227, 172, 472, 4
... $ grad.rate   (dbl)
---------------------------------- $ Private     (fctr) Yes, yes, yes, yes, yes, yes, yes, yes, yes, yes, yes,...
$ iselite (FCTR) not elite, not elite, not     elite, elite, not elite, not ...

Distance Calculation

The first step in clustering is to define the measurement of distances between samples, the most commonly used distance measurement method is the Euclidean distance. However, Euclidean distance is only applicable to continuous variables, so this article will use another distance measurement method--gower distance. Gower Distance

The definition of Gower distance is very simple. First, each type of variable has a special distance measurement method, and the method standardizes the variable between [0,1]. Next, the method of weighted linear combination is used to calculate the final distance matrix. The different types of variables are calculated as follows: Continuous variables: Using normalized Manhattan distance sequence variables: First, the variables are sorted sequentially, then using the specially adjusted Manhattan distance nominal variable: First convert the variables containing K categories to K 0-1 variables, then use the Dice coefficients for further Calculation advantages: Easy to understand and easy to calculate disadvantages: it is very susceptible to the non-standardized continuous variable outliers, so the data conversion process is essential; This method requires a large amount of memory

Using the Daisy function, we only need one line of code to calculate the Gower distance. It is important to note that since the freshman enrollment is a right-biased variable, we need to do a logarithmic conversion. The Daisy function has built-in logarithmic conversions, and you can invoke the help documentation to get more parameter descriptions.

In [6]:

# Remove College name before clustering

gower_dist <-Daisy (college_clean[,-1],
                    metric = "Gower",
                    type = l IST (logratio = 3))

# Check attributes to ensure the correct methods is being used
# (I = interval, N = nominal) 
  # Note that despite Logratio being called, 
# The type remains coded as "I"

Summary (gower_dist)

OUT[6]:

301476 dissimilarities, summarized:
     Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
0.0018601 0.1034400 0.2358700 0.2314500 0.3271400 0.7773500 
Metric:  mixed;  Types = i, I, I, I, N, n number of 
objects:777

In addition, we can judge the rationality of the measurement method by observing the most similar and least similar samples. In this case, the University of St. Thomas is the most similar to John Carol University, while the Oklahoma University of Technology and Art and Harvard University differ most.

In [7]:

Gower_mat <-As.matrix (gower_dist)

# Output Most similar pair

college_clean[
  which (Gower_mat = = min ( Gower_mat[gower_mat! = min (Gower_mat)]),
        arr.ind = TRUE) [1,],]

OUT[7]:

	name	accept_rate	outstate	Enroll	grad.rate	Private	Iselite
682	University of St. Thomas MN	0.8784638	11712	828	89	Yes	Not Elite
284	John Carroll University	0.8711276	11700	820	89	Yes	Not Elite

In [8]:

# Output Most dissimilar pair

college_clean[
  which (Gower_mat = = Max (Gower_mat[gower_mat! = max (Gower_mat))),
        arr.ind = TRUE) [1,],]

OUT[8]:

outstate

	name	Accept_rate	Enroll	grad.rate	Pri Vate	Iselite
673	Un Iversity of Sci. and Arts of Oklahoma	0.9824561	3687	208	to	No	not Elite
251	Harvard University	0.1561486	18485	1 606		Yes	Elite

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More