Using cluster analysis, we can easily see the distribution of samples in a data set. In the past, the article on cluster analysis usually only describes how to deal with continuous variables, which do not show too much how to deal with mixed data (such as containing continuous variables, nominal variables and sequential variables of the data). This paper will use Gower distance, PAM (partitioning around medoids) algorithm and contour coefficient to describe how to do cluster analysis of mixed data.
This paper is divided into three parts: selection of the number of selected clusters in the Distance computing clustering algorithm
For ease of presentation, this article uses the College dataset in the ISLR package directly. The data set contains 777 data from American universities since 1995, mainly with the following variables: Continuous variable admission rate tuition number classification variables public or private institutions are high-level institutions, that is, all new students graduated from the top 10% high school in the proportion of the number is greater than 50%
The R packages covered in this article are:
In [3]:
Set.seed (1680) # Set a random seed, making the results of this article a reproducible library (DPLYR) library (ISLR) library (
cluster) library
(Rtsne )
Library (GGPLOT2)
Attaching package: ' Dplyr '
The following objects is masked from ' package:stats ':
filter, lag the
following O Bjects is masked from ' package:base ':
intersect, Setdiff, setequal, union
Before building a clustering model, we need to do some data cleansing: The admission rate is equal to the number of admissions divided by the total number of applicants. Determine whether a school is a high-level institution, depending on whether the number of freshmen enrolled in the top 10% high schools in the school is greater than 50%.
In [5]:
college_clean <-College%>% mutate (name = Row.names (.), accept_rate = Accept/apps, Iselite = C
UT (TOP10PERC, breaks = C (0, +, +), labels = c ("Not elite", "elite"), Include.lowest = TRUE)%>% mutate (Iselite = factor (iselite))%>% Select (Name, Accept_rate, O Utstate, Enroll, Grad.rate, Private, iselite) Glimpse (College_clean)
observations:777
variables:7
$ name (CHR) "Abilene Christian University", "Adelphi University", "...
$ accept_rate (dbl) 0.7421687, 0.8801464, 0.7682073, 0.8369305, 0.7564767, ...
$ outstate (dbl) 7440, 12280, 11250, 12960, 7560, 13500, 13290, 13868, 1
... $ Enroll (dbl) 721, MB, 336, 137,, 158, 103, 489, 227, 172, 472, 4
... $ grad.rate (dbl)
---------------------------------- $ Private (fctr) Yes, yes, yes, yes, yes, yes, yes, yes, yes, yes, yes,...
$ iselite (FCTR) not elite, not elite, not elite, elite, not elite, not ...
Distance Calculation
The first step in clustering is to define the measurement of distances between samples, the most commonly used distance measurement method is the Euclidean distance. However, Euclidean distance is only applicable to continuous variables, so this article will use another distance measurement method--gower distance. Gower Distance
The definition of Gower distance is very simple. First, each type of variable has a special distance measurement method, and the method standardizes the variable between [0,1]. Next, the method of weighted linear combination is used to calculate the final distance matrix. The different types of variables are calculated as follows: Continuous variables: Using normalized Manhattan distance sequence variables: First, the variables are sorted sequentially, then using the specially adjusted Manhattan distance nominal variable: First convert the variables containing K categories to K 0-1 variables, then use the Dice coefficients for further Calculation advantages: Easy to understand and easy to calculate disadvantages: it is very susceptible to the non-standardized continuous variable outliers, so the data conversion process is essential; This method requires a large amount of memory
Using the Daisy function, we only need one line of code to calculate the Gower distance. It is important to note that since the freshman enrollment is a right-biased variable, we need to do a logarithmic conversion. The Daisy function has built-in logarithmic conversions, and you can invoke the help documentation to get more parameter descriptions.
In [6]:
# Remove College name before clustering
gower_dist <-Daisy (college_clean[,-1],
metric = "Gower",
type = l IST (logratio = 3))
# Check attributes to ensure the correct methods is being used
# (I = interval, N = nominal)
# Note that despite Logratio being called,
# The type remains coded as "I"
Summary (gower_dist)
OUT[6]:
301476 dissimilarities, summarized:
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.0018601 0.1034400 0.2358700 0.2314500 0.3271400 0.7773500
Metric: mixed; Types = i, I, I, I, N, n number of
objects:777
In addition, we can judge the rationality of the measurement method by observing the most similar and least similar samples. In this case, the University of St. Thomas is the most similar to John Carol University, while the Oklahoma University of Technology and Art and Harvard University differ most.
In [7]:
Gower_mat <-As.matrix (gower_dist)
# Output Most similar pair
college_clean[
which (Gower_mat = = min ( Gower_mat[gower_mat! = min (Gower_mat)]),
arr.ind = TRUE) [1,],]
OUT[7]:
|
name |
accept_rate |
outstate |
Enroll |
grad.rate |
Private |
Iselite |
682 |
University of St. Thomas MN |
0.8784638 |
11712 |
828 |
89 |
Yes |
Not Elite |
284 |
John Carroll University |
0.8711276 |
11700 |
820 |
89 |
Yes |
Not Elite |
In [8]:
# Output Most dissimilar pair
college_clean[
which (Gower_mat = = Max (Gower_mat[gower_mat! = max (Gower_mat))),
arr.ind = TRUE) [1,],]
OUT[8]:
outstate
|
name |
Accept_rate |
Enroll |
grad.rate |
Pri Vate |
Iselite |
673 |
Un Iversity of Sci. and Arts of Oklahoma |
0.9824561 |
3687 |
208 |
to |
No |
not Elite |
251 |
Harvard University |
0.1561486 |
18485 |
1 606 |
|
Yes |
Elite |