9th. MDS: Visually study Senator similarity
Based on similarity clustering: The main purpose of this chapter is to illustrate the similarities and differences between different observational records and how to understand the concept of distance.
Multi-dimensional calibration techniques (multidimensional scaling, MDS) are designed to be clustered based on distance measurements between observations. The data is visualized only by the distance metric between all points.
MDS Processing: Enter a distance matrix that contains the distance between any two points in the dataset, and return a collection of coordinates that can approximate the distance between each pair of data points (when the dimension is low, the information is missing and only approximate).
The following is a simple example:
#距离度量与多维定标简介 # randomly generated "user"-"scoring" Matrix Set.seed (851982) Ex.matrix <-matrix (sample (C ( -1, 0, 1), replace = TRUE), Nrow = 4, Ncol = 6) row.names (Ex.matrix) <-C (' A ', ' B ', ' C ', ' D ') colnames (Ex.matrix) <-C (' P1 ', ' P2 ', ' P3 ', ' P4 ', ' P5 ', ' P6 ') #将矩阵与本 The Ex.mult <-ex.matrix%*% t (Ex.matrix) #数据点之间的距离矩阵ex. Dist <-Dist (ex.mult), the difference matrix between "user" and "user" is multiplied by the body's transpose. #classical ( Metric) Multidimensional Scalingex.mds <-Cmdscale (ex.dist) plot (ex.mds, type = ' n ') text (Ex.mds, C (' A ', ' B ', ' C ', ' D '))
The senator is clustered by a registered voting record:
As with the above ideas, different senators to the bill of approval, objection, waiver analysis, to get the difference matrix, distance matrix, multidimensional calibration, and then visualized to show out.
Load data:
Library (foreign) library (ggplot2) data.dir <-"ml_for_hackers/09-mds/data/roll_call/" Data.files <-list.files ( Data.dir) Rollcall.data <-lapply (data.files, function (f) read.dta (paste (Data.dir, F, Sep = ""), Convert.factors = Fals E)) #查看行数与列数 #dim (rollcall.data[[1])
Simple processing of data: Deletion of observations with fewer votes, simplified voting: Code 123 simplified to affirmative vote; Code 456 simplified to negative; Code 7890 simplified to abstain
Rollcall.simplified <-Function (DF) { #state编号为99是副总统, delete no.pres <-subset (DF, State < 99) because of fewer votes Simplification of the #编码1 to the affirmative vote; Code 4~6 simplified to negative; Code 7890 simplifies to abstain from voting for (I-in 10:ncol (no.pres)) { no.pres[, I] <-ifelse ( no.pres[, I] > 6, 0, no.pres[, I]) no.pres[, I] <-ifelse (no.pres[, I] > 0 & no.pres[, I] < 4, 1, NO.P res[, I]) no.pres[, I] <-ifelse (no.pres[, I] > 1,-1, no.pres[, I]) } return (As.matrix (no.pres[, 10:NC OL (No.pres)])}rollcall.simple <-lapply (Rollcall.data, rollcall.simplified)
Calculating distance matrices and multidimensional calibration:
Multidimensional calibration Time Multiplication (-1), is for the intuitive, generally think the Democratic Party as the left, the Republican Party for the right
Rollcall.dist <-lapply (rollcall.simple, function (m) dist (M%*% t (m)) Rollcall.mds <-lapply (rollcall.dist, function (d) as.data.frame ((Cmdscale (d, k = 2)) *-1))
Simple processing of ROLLCALL.MDS to facilitate subsequent drawing
Congresses <-101:111for (i in 1:length (Rollcall.mds)) { names (Rollcall.mds[[i]]) <-C ("x", "y") Congress <-subset (Rollcall.data[[i], state <) #为统一格式, name only takes the surname, deposit in Congress.name congress.names <-sapply ( As.character (Congress$name), function (n) strsplit (n, "[,]") [[1]][1]) #统一name, party turns into a factor variable, adds congressional session information Rollcall.mds[[i]] <-transform (Rollcall.mds[[i]], name = congress.names, party = As.factor (congress$party), Congress = Congresses[i])}
Take the 110th session as an example to visualize the members of Congress: note When calling Rollcall.mds, the list number starts at 1 instead of 0.
First, the Ggplot object is created, the basic information is stored, two pictures are drawn, one is expressed in the shape of a point, and the other is expressed in a specific name.
cong.110 <-rollcall.mds[[10]]base.110 <-ggplot (cong.110, AES (x = x, y = y)) + scale_size (range = C (2,2), guide = "None") + Scale_alpha (guide = "none") + THEME_BW () + theme (axis.ticks = Element_blank (), axis.text.x = Element_blan K (), Axis.text.y = Element_blank (), panel.grid.major = Element_blank ()) + ggtitle ("Roll Call Vote MDS clustering for 1 10th U.S. Senate ") + Xlab (" ") + Ylab (" ") + scale_shape (name =" Party ", breaks = C (" + "," $ "," 328 "), labels = c ("Dem.", "Rep.", "Ind."), solid = FALSE) + scale_color_manual (name = "Party", values = C ("+" = "Red", "$" = "Blue" , "328" = "Black"), breaks = C ("+", "Max", "328"), labels = c ("Dem.", "Rep.", "Ind.")) Print (base.110 + geom_point (AES (shape = party, alpha = 0.75, size = 2)) print (base.110 + geom_text (AES (color = party, Alph A = 0.75, label = cong.110$name, size = 2))
Draw all the previous figures and compare them together (the Facet_wrap () function can be drawn separately according to the Congress of each session)
All.mds <-Do.call (Rbind, Rollcall.mds) all.plot <-ggplot (ALL.MDS, AES (x = x, y = y)) + Geom_point (Aes (shape = P Arty, alpha = 0.75, size = 2)) + scale_size (range = C (2,2), guide = "none") + Scale_alpha (guide = "none") + Theme_ BW () + theme (axis.ticks = Element_blank (), axis.text.x = Element_blank (), Axis.text.y = Element_blank (), Panel.grid.major = Element_blank ()) + ggtitle ("Roll Call Vote MDS clustering for U.S. Senate (101st-111th Congress)" + Xlab ("") + Ylab ("") + scale_shape (name = "Party", breaks = C ("+", "$", "328"), labels = c ("Dem.", "Rep.", "Ind."), solid = FALSE) + facet_wrap (~ Congress) All.plot
It is important to note that although the 101 session looks closer, it does not mean that the two parties are undifferentiated, as the points of the same symbol (the same party) are still separate from each other together. "Looks closer than other graphs" is only because of the axis problem, because these 11 graphs use the same scale axis. At the same time, these differences between graphs and graphs are not enough to explain the lesser degree of differentiation of the 101 session, which is likely to be influenced by other factors such as the number of observations.
[Reading notes] machine learning: Practical Case Analysis (9)