Machine Learning Quick Start (3)

Source: Internet
Author: User

Machine Learning Quick Start (3)

Abstract: This article briefly describes how to use clustering to analyze the actual political trend of American senator through voting records

Statement: (the content of this article is not original, but it has been translated and summarized by myself. Please indicate the source for reprinting)

The content of this article Source: https://www.dataquest.io/mission/60/clustering-basics

 

The linear regression and classification used in the previous two articles are both supervised Machine Learning (training models based on existing data and predicting unknown data ), unsupervised learning is not an attempt to predict anything, but to find features in data. In unsupervised learning, an important method is clustering. Clustering Algorithms aggregate data with the same features in a group.

Raw data presentation

When the US Senate wants to pass a law, it is up to the senator to vote. These members mainly come from two political parties: Democrats and Republicans ), the data used now is the voting records of these members. Each row represents a member's situation (party-party, D stands for the Republican party, R stands for the Democratic party, and I stands for the non-partisan party, the third column represents the vote of a certain bill. 1 stands for favor, 0 stands for opposition, and 0.5 stands for waiver)

import pandasvotes = pandas.read_csv('114_congress.csv')

Print (votes ["party"]. value_counts ())

From sklearn. metrics. pairwise import euclidean_distancesprint (euclidean_distances (votes. iloc [0, 3:], votes. iloc [:]) # because the first three columns are not numeric type, data in the first three columns must be excluded.

Import pandas slave Rom sklearn. cluster import KMeans # The n_clusters parameter specifies the number of groups. random_state = 1 is used to reproduce the same result. kmeans_model = KMeans (n_clusters = 2, random_state = 1) # Use fit_transform () method To train the model senator_distances = kmeans_model.fit_transform (votes. iloc [:, 3:])

An ndarray is generated. Each row represents a member. The first column represents the distance between the member and the first group. The second column represents the distance between the member and the second group.

Labels = kmeans_model.labels_print (pd. crosstab (labels, votes ["party"])

Democratic_outliers = votes [(labels = 1) & (votes ["party"] = "D")]

Plt. scatter (x = senator_distances [:, 0], y = senator_distances [:, 1], c = labels) plt. show ()

Extremism = (senator_distances ** 3 ). sum (axis = 1) votes ["extremism"] = extremismvotes. sort ("extremism", inplace = True, ascending = False) # sort in descending order based on the radicals

 

Summary

Clustering is a powerful method used to find data features. When supervised machine learning methods are not making progress, you can try unsupervised learning methods. Generally, it is a good start to use unsupervised learning before using supervised learning methods.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.