Machine Learning Quick Start (3)
Abstract: This article briefly describes how to use clustering to analyze the actual political trend of American senator through voting records
Statement: (the content of this article is not original, but it has been translated and summarized by myself. Please indicate the source for reprinting)
The content of this article Source: https://www.dataquest.io/mission/60/clustering-basics
The linear regression and classification used in the previous two articles are both supervised Machine Learning (training models based on existing data and predicting unknown data ), unsupervised learning is not an attempt to predict anything, but to find features in data. In unsupervised learning, an important method is clustering. Clustering Algorithms aggregate data with the same features in a group.
Raw data presentation
When the US Senate wants to pass a law, it is up to the senator to vote. These members mainly come from two political parties: Democrats and Republicans ), the data used now is the voting records of these members. Each row represents a member's situation (party-party, D stands for the Republican party, R stands for the Democratic party, and I stands for the non-partisan party, the third column represents the vote of a certain bill. 1 stands for favor, 0 stands for opposition, and 0.5 stands for waiver)
import pandasvotes = pandas.read_csv('114_congress.csv')
Print (votes ["party"]. value_counts ())
From sklearn. metrics. pairwise import euclidean_distancesprint (euclidean_distances (votes. iloc [0, 3:], votes. iloc [:]) # because the first three columns are not numeric type, data in the first three columns must be excluded.
Import pandas slave Rom sklearn. cluster import KMeans # The n_clusters parameter specifies the number of groups. random_state = 1 is used to reproduce the same result. kmeans_model = KMeans (n_clusters = 2, random_state = 1) # Use fit_transform () method To train the model senator_distances = kmeans_model.fit_transform (votes. iloc [:, 3:])
An ndarray is generated. Each row represents a member. The first column represents the distance between the member and the first group. The second column represents the distance between the member and the second group.
Labels = kmeans_model.labels_print (pd. crosstab (labels, votes ["party"])
Democratic_outliers = votes [(labels = 1) & (votes ["party"] = "D")]
Plt. scatter (x = senator_distances [:, 0], y = senator_distances [:, 1], c = labels) plt. show ()
Extremism = (senator_distances ** 3 ). sum (axis = 1) votes ["extremism"] = extremismvotes. sort ("extremism", inplace = True, ascending = False) # sort in descending order based on the radicals
Summary
Clustering is a powerful method used to find data features. When supervised machine learning methods are not making progress, you can try unsupervised learning methods. Generally, it is a good start to use unsupervised learning before using supervised learning methods.