R-language hybrid data clustering case study

Source: Internet
Author: User
Tags mixed random seed

Using cluster analysis, we can easily see the distribution of samples in a data set. In the past, the article on cluster analysis usually only describes how to deal with continuous variables, which do not show too much how to deal with mixed data (such as containing continuous variables, nominal variables and sequential variables of the data). This paper will use Gower distance, PAM (partitioning around medoids) algorithm and contour coefficient to describe how to do cluster analysis of mixed data.

This paper is divided into three parts: selection of the number of selected clusters in the Distance computing clustering algorithm

For ease of presentation, this article uses the College dataset in the ISLR package directly. The data set contains 777 data from American universities since 1995, mainly with the following variables: Continuous variable admission rate tuition number classification variables public or private institutions are high-level institutions, that is, all new students graduated from the top 10% high school in the proportion of the number is greater than 50%

The R packages covered in this article are:

In [3]:

Set.seed (1680) # Set a random seed, making the results of this article a reproducible library (DPLYR) library (ISLR) library (
cluster) library
(Rtsne )
Library (GGPLOT2)
Attaching package: ' Dplyr '

The following objects is masked from ' package:stats ':

    filter, lag the

following O Bjects is masked from ' package:base ':

    intersect, Setdiff, setequal, union

Before building a clustering model, we need to do some data cleansing: The admission rate is equal to the number of admissions divided by the total number of applicants. Determine whether a school is a high-level institution, depending on whether the number of freshmen enrolled in the top 10% high schools in the school is greater than 50%.

In [5]:

college_clean <-College%>% mutate (name = Row.names (.), accept_rate = Accept/apps, Iselite = C
                       UT (TOP10PERC, breaks = C (0, +, +), labels = c ("Not elite", "elite"), Include.lowest = TRUE)%>% mutate (Iselite = factor (iselite))%>% Select (Name, Accept_rate, O Utstate, Enroll, Grad.rate, Private, iselite) Glimpse (College_clean) 
observations:777
variables:7
$ name        (CHR) "Abilene Christian University", "Adelphi University", "...
$ accept_rate (dbl) 0.7421687, 0.8801464, 0.7682073, 0.8369305, 0.7564767, ...
$ outstate    (dbl) 7440, 12280, 11250, 12960, 7560, 13500, 13290, 13868, 1
... $ Enroll      (dbl) 721, MB, 336, 137,, 158, 103, 489, 227, 172, 472, 4
... $ grad.rate   (dbl)
---------------------------------- $ Private     (fctr) Yes, yes, yes, yes, yes, yes, yes, yes, yes, yes, yes,...
$ iselite (FCTR) not elite, not elite, not     elite, elite, not elite, not ...
Distance Calculation

The first step in clustering is to define the measurement of distances between samples, the most commonly used distance measurement method is the Euclidean distance. However, Euclidean distance is only applicable to continuous variables, so this article will use another distance measurement method--gower distance. Gower Distance

The definition of Gower distance is very simple. First, each type of variable has a special distance measurement method, and the method standardizes the variable between [0,1]. Next, the method of weighted linear combination is used to calculate the final distance matrix. The different types of variables are calculated as follows: Continuous variables: Using normalized Manhattan distance sequence variables: First, the variables are sorted sequentially, then using the specially adjusted Manhattan distance nominal variable: First convert the variables containing K categories to K 0-1 variables, then use the Dice coefficients for further Calculation advantages: Easy to understand and easy to calculate disadvantages: it is very susceptible to the non-standardized continuous variable outliers, so the data conversion process is essential; This method requires a large amount of memory

Using the Daisy function, we only need one line of code to calculate the Gower distance. It is important to note that since the freshman enrollment is a right-biased variable, we need to do a logarithmic conversion. The Daisy function has built-in logarithmic conversions, and you can invoke the help documentation to get more parameter descriptions.

In [6]:

# Remove College name before clustering

gower_dist <-Daisy (college_clean[,-1],
                    metric = "Gower",
                    type = l IST (logratio = 3))

# Check attributes to ensure the correct methods is being used
# (I = interval, N = nominal) 
  # Note that despite Logratio being called, 
# The type remains coded as "I"

Summary (gower_dist)

OUT[6]:

301476 dissimilarities, summarized:
     Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
0.0018601 0.1034400 0.2358700 0.2314500 0.3271400 0.7773500 
Metric:  mixed;  Types = i, I, I, I, N, n number of 
objects:777

In addition, we can judge the rationality of the measurement method by observing the most similar and least similar samples. In this case, the University of St. Thomas is the most similar to John Carol University, while the Oklahoma University of Technology and Art and Harvard University differ most.

In [7]:

Gower_mat <-As.matrix (gower_dist)

# Output Most similar pair

college_clean[
  which (Gower_mat = = min ( Gower_mat[gower_mat! = min (Gower_mat)]),
        arr.ind = TRUE) [1,],]

OUT[7]:

name accept_rate outstate Enroll grad.rate Private Iselite
682 University of St. Thomas MN 0.8784638 11712 828 89 Yes Not Elite
284 John Carroll University 0.8711276 11700 820 89 Yes Not Elite

In [8]:

# Output Most dissimilar pair

college_clean[
  which (Gower_mat = = Max (Gower_mat[gower_mat! = max (Gower_mat))),
        arr.ind = TRUE) [1,],]

OUT[8]:

outstate
  name Accept_rate Enroll grad.rate Pri Vate Iselite
673 Un Iversity of Sci. and Arts of Oklahoma 0.9824561 3687 208 to No not Elite
251 Harvard University 0.1561486 18485 1 606 Yes Elite

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.