[Reading notes] machine learning: Practical Case Analysis (9)

Source: Internet
Author: User

9th. MDS: Visually study Senator similarity

Based on similarity clustering: The main purpose of this chapter is to illustrate the similarities and differences between different observational records and how to understand the concept of distance.

Multi-dimensional calibration techniques (multidimensional scaling, MDS) are designed to be clustered based on distance measurements between observations. The data is visualized only by the distance metric between all points.

MDS Processing: Enter a distance matrix that contains the distance between any two points in the dataset, and return a collection of coordinates that can approximate the distance between each pair of data points (when the dimension is low, the information is missing and only approximate).

The following is a simple example:

#距离度量与多维定标简介 # randomly generated "user"-"scoring" Matrix Set.seed (851982) Ex.matrix <-matrix (sample (C ( -1, 0, 1), replace = TRUE), Nrow = 4, Ncol = 6) row.names (Ex.matrix) <-C (' A ', ' B ', ' C ', ' D ') colnames (Ex.matrix) <-C (' P1 ', ' P2 ', ' P3 ', ' P4 ', ' P5 ', ' P6 ') #将矩阵与本 The Ex.mult <-ex.matrix%*% t (Ex.matrix) #数据点之间的距离矩阵ex. Dist <-Dist (ex.mult), the difference matrix between "user" and "user" is multiplied by the body's transpose. #classical ( Metric) Multidimensional Scalingex.mds <-Cmdscale (ex.dist) plot (ex.mds, type = ' n ') text (Ex.mds, C (' A ', ' B ', ' C ', ' D '))

  

The senator is clustered by a registered voting record:

As with the above ideas, different senators to the bill of approval, objection, waiver analysis, to get the difference matrix, distance matrix, multidimensional calibration, and then visualized to show out.

Load data:

Library (foreign) library (ggplot2) data.dir <-"ml_for_hackers/09-mds/data/roll_call/" Data.files <-list.files ( Data.dir) Rollcall.data <-lapply (data.files, function (f) read.dta (paste (Data.dir, F, Sep = ""), Convert.factors = Fals E)) #查看行数与列数 #dim (rollcall.data[[1])

  

Simple processing of data: Deletion of observations with fewer votes, simplified voting: Code 123 simplified to affirmative vote; Code 456 simplified to negative; Code 7890 simplified to abstain

Rollcall.simplified <-Function (DF) {  #state编号为99是副总统, delete  no.pres <-subset (DF, State < 99) because of fewer votes Simplification of the  #编码1 to the affirmative vote; Code 4~6 simplified to negative; Code 7890 simplifies to abstain from voting for  (I-in 10:ncol (no.pres))   {    no.pres[, I] <-ifelse ( no.pres[, I] > 6, 0, no.pres[, I])    no.pres[, I] <-ifelse (no.pres[, I] > 0 & no.pres[, I] < 4, 1, NO.P res[, I])    no.pres[, I] <-ifelse (no.pres[, I] > 1,-1, no.pres[, I])  }  return (As.matrix (no.pres[, 10:NC OL (No.pres)])}rollcall.simple <-lapply (Rollcall.data, rollcall.simplified)

  

Calculating distance matrices and multidimensional calibration:

Multidimensional calibration Time Multiplication (-1), is for the intuitive, generally think the Democratic Party as the left, the Republican Party for the right

Rollcall.dist <-lapply (rollcall.simple, function (m) dist (M%*% t (m)) Rollcall.mds <-lapply (rollcall.dist, function (d) as.data.frame ((Cmdscale (d, k = 2)) *-1))

  

Simple processing of ROLLCALL.MDS to facilitate subsequent drawing

Congresses <-101:111for (i in 1:length (Rollcall.mds)) {  names (Rollcall.mds[[i]]) <-C ("x", "y")  Congress <-subset (Rollcall.data[[i], state <)  #为统一格式, name only takes the surname, deposit in Congress.name congress.names <-sapply  ( As.character (Congress$name), function (n) strsplit (n, "[,]") [[1]][1])  #统一name, party turns into a factor variable, adds congressional session information  Rollcall.mds[[i]] <-transform (Rollcall.mds[[i]], name = congress.names, party = As.factor (congress$party), Congress = Congresses[i])}

  

Take the 110th session as an example to visualize the members of Congress: note When calling Rollcall.mds, the list number starts at 1 instead of 0.

First, the Ggplot object is created, the basic information is stored, two pictures are drawn, one is expressed in the shape of a point, and the other is expressed in a specific name.

cong.110 <-rollcall.mds[[10]]base.110 <-ggplot (cong.110, AES (x = x, y = y)) +   scale_size (range = C (2,2), guide = "None") + Scale_alpha (guide = "none") + THEME_BW () +   theme (axis.ticks = Element_blank (), axis.text.x = Element_blan K (), Axis.text.y = Element_blank (), panel.grid.major = Element_blank ()) +   ggtitle ("Roll Call Vote MDS clustering for 1 10th U.S. Senate ") + Xlab (" ") +   Ylab (" ") +   scale_shape (name =" Party ", breaks = C (" + "," $ "," 328 "), labels = c ("Dem.", "Rep.", "Ind."), solid = FALSE) +   scale_color_manual (name = "Party", values = C ("+" = "Red", "$" = "Blue" , "328" = "Black"),                      breaks = C ("+", "Max", "328"), labels = c ("Dem.", "Rep.", "Ind.")) Print (base.110 + geom_point (AES (shape = party, alpha = 0.75, size = 2)) print (base.110 + geom_text (AES (color = party, Alph A = 0.75, label = cong.110$name, size = 2))

  

Draw all the previous figures and compare them together (the Facet_wrap () function can be drawn separately according to the Congress of each session)

All.mds <-Do.call (Rbind, Rollcall.mds) all.plot <-ggplot (ALL.MDS, AES (x = x, y = y)) +   Geom_point (Aes (shape = P Arty, alpha = 0.75, size = 2)) +   scale_size (range = C (2,2), guide = "none") +   Scale_alpha (guide = "none") + Theme_ BW () +   theme (axis.ticks = Element_blank (), axis.text.x = Element_blank (), Axis.text.y = Element_blank (),         Panel.grid.major = Element_blank ()) +   ggtitle ("Roll Call Vote MDS clustering for U.S. Senate (101st-111th Congress)" +   Xlab ("") + Ylab ("") +   scale_shape (name = "Party", breaks = C ("+", "$", "328"), labels = c ("Dem.", "Rep.", "Ind."), solid = FALSE) +   facet_wrap (~ Congress) All.plot

 

It is important to note that although the 101 session looks closer, it does not mean that the two parties are undifferentiated, as the points of the same symbol (the same party) are still separate from each other together. "Looks closer than other graphs" is only because of the axis problem, because these 11 graphs use the same scale axis. At the same time, these differences between graphs and graphs are not enough to explain the lesser degree of differentiation of the 101 session, which is likely to be influenced by other factors such as the number of observations.

 

[Reading notes] machine learning: Practical Case Analysis (9)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.