K-means Clusternig example with Python and Scikit-learn (recommended)

Source: Internet
Author: User

https://www.pythonprogramming.net/flat-clustering-machine-learning-python-scikit-learn/

Unsupervised machine Learning:flat Clusteringk-means Clusternig example with Python and Scikit-learn



This series was concerning "unsupervised machine learning." The difference between supervised and unsupervised machine learning was whether or not we, the scientist, is providing the Machine with labeled Data.

Unsupervised machine learning are where the scientist does not provide the machine with labeled data, and the machine is ex Pected to derive structure from the data all on the IT Own.

There is many forms of this, though the main form of unsupervised machine learning is clustering. Within clustering, you have "flat" clustering or "hierarchical" clustering.

Flat Clustering

Flat clustering is where the scientist tells the "machine" and "many categories to cluster the" data into.

Hierarchical

Hierarchical clustering is where the machine was allowed to decide how many clusters to create based on its own algorithms.

This page would cover a Flat clustering example, and the next tutorial would cover a hierarchical clustering example.

now, How can we use the unsupervised machine learning for? In general, unsupervised machine learning can actually solve the exact same problems as supervised machine learning, Thoug H It may is not is as efficient or accurate.

Unsupervised machine learning are most often applied to questions of underlying structure. Genomics, for example, is a area where we don't truly understand the underlying structure. Thus, we use unsupervised machine learning to help us figure out the Structure.

Unsupervised learning can also aid in "feature reduction." A term we'll cover eventually Here's "Principal Component analysis," or PCA, which is another form of feature reduction , used frequently with unsupervised machine learning. PCA attempts to locate linearly uncorrelated variables, calling these the Principal components, since these is the more " Unique "elements that differentiate or describe whatever the object of an analysis is.

There is also a meshing of supervised and unsupervised machine learning, often called semi-supervised machine learning. You'll often find things get more complicated with real world examples. You could find, for example, so first you want the use of unsupervised machine learning for feature reduction and then you'll s Hift to supervised machine learning once are used, for example, Flat Clustering to group your data into, clusters, Which is now going to be your and the labels for supervised learning.

What might is an actual example's? How about the we ' re trying to differentiate between male and female Faces. We aren ' t already sure what's the defining differences between a male and female face is, so we take to unsupervised Machin E Learning First. Our hopeful goal would be is to create a algorithm that would naturally just group the faces into the groups. We ' re likely to go ahead and use Flat clustering for this, and then we'll likely test the algorithm to see if it is Indee D accurate, using labeled data only for testing, not for Training. If we find the machine was successful, we now actually has our labels, and features, and a male Face. We can then use the PCA, or maybe we already did. either, we can try to get feature count Down. Once we ' ve do this, we use these labels with their Principle components as features, which we can then feeds into a super vised machine learning algorithm for actual the future Identification.

Now so you know some of the uses and some key terms, let's see a actual example with Python and the Scikit -learn (sklearn) module.

Don‘t have Python or Sklearn?

Python is a programming language, and the language this entire website covers tutorials On. If you need python, click on the link to python.org and download the latest version of Python.

Scikit-learn (sklearn) is a popular machine learning module for the Python programming Language.

The Scikit-learn module depends on matplotlib, SciPy, and NumPy as Well. You can use Pip to install all of the these, once you have Python.

Don‘t know what pip is or how to install modules?

Pip is probably the Easiest-to-install packages Once you install Python, you should be able to open your Comm and prompt, like Cmd.exe on windows, or bash on linux, and type:

pip install scikit-learn

Having trouble still? No problem, There ' a tutorial for that: pip install Python modules Tutorial.

If you ' re still have trouble, feel free-to-contact us, using the "contact" in the footer of this Website.

import numpy as< Span class= "pln" > Npimport Matplotlib.as Pltfrom matplotlib import Stylestyle. ( "ggplot" ) from< Span class= "pln" > Sklearn.import kmeans       

here, we ' re just doing our basic imports. We ' re importing NumPy which is a useful numbers crunching module, then matplotlib for graphing, and then Kmeans from Sklea Rn.

Confused about imports and modules? (other than "KMeans")

If you ' re confused about imports, need to first run through the Python 3 Basics series, or specifically t He module import Syntax Tutorial.



The Kmeans import from Sklearn.cluster are in reference to the K-means clustering algorithm. The general idea of clustering was to cluster data points together using various METHODS. You can probably guess this k-means uses something to does with Means. What ends up happening was a centroid, or prototype point, was identified, and data points are ' clustered ' into their groups By the centroid they is the closest to.

Clusters being Called Cells, or Voronoi Cells, and references to Lloyd‘s Algorithm

One of the things that makes any new topic confusing are a lot of complex sounding terms. I do my best to keep things simple, and not everyone is as kind as Me.

Some people would refer to the style of clusters, the wind up seeing as "Voronoi" Cells. Usually the "clusters" has defining "edges" to them this, when shaded or colored in, look like geometrical polygons, or C ells, like This:

The K-means algorithm gets its origins from "Lloyd's algorithm," which basically does the exact same thing.


X= [1, 5, 1.5, 8, 1, 9]y = [2, 8, 1.8 8, 0.6 , 11]plt scatter (x,y plt. ()               

This block of code was not required to machine learning. What we ' re doing are plotting and visualizing our data before feeding it into the machine learning Algorithm.

Running the code up to this point would provide you with the following graph:

This is the same set of data and graph that we used for our support Vector Machine/linear SVC example with supervise D Machine learning.

You can probably look at the this graph, and group this data all on your own. Imagine If this graph is 3D. It would be a little harder. Now imagine this graph is 50-dimensional. Suddenly you ' re immobilized!

In the supervised machine learning example with this data, we were allowed to feeds this data to the machine along with lab Els. Thus, the lower left group had labeling, and the upper right grouping did too. Our task there is to then accept the future points and properly group them according the GROUPS. Easy Enough.

Our Tasks here, however, is a bit different. We do not have labeled data, and we want the "machine" to "figure out" of the IT own that it needs to group the Data.

For now, since we ' re doing flat-clustering, we task is a bit easier since we can tell the machine that we want it categor Ized into and Groups. still, however, How might are you doing this?

K-means approaches the problem by finding similar Means, repeatedly trying to find centroids this match with the least Var Iance in Groups

This repeatedly trying ends up leaving this algorithm with fairly poor performance, though performance was an issue with Al L Machine Learning Algorithms. This is what it is usually suggested so you use a highly stream-lined and efficient algorithm so is already tested Heav ily rather than creating your own.

You also has to decide, as the scientist, how highly you value precision as compared to Speed. There is always a balance between precision and speed/performance. More on this later, however. Moving on with the code

X=Np.Array([[1, 2], [5 , 8],  [1.5, 1.8],  [8,  8], [1 0.6], [ 9, 11]])       

here, we ' re simply converting our data to a NumPy array. See the video If you ' re Confused. You should see each of the brackets here is the same x, y coordinates as Before. We ' re doing this because a NumPy array of features are what Scikit-learn/sklearn expects.

kmeans =  Kmeans (n_clusters=2) kmeans. (x) centroids =< Span class= "pln" > Kmeans.= Kmeans.print (centroidsprint (labels   

here, we initialize Kmeans to being the Kmeans algorithm (flat clustering), with the required parameter of how many clusters (n_clusters).

Next, we use. fit () to fit the data (learning)

Next, we ' re grabbing the values found for the centroids, and based on the fitment, as well as the labels for each centroid.

Here is the "labels" here is labels, the machine have assigned on its own, and same with the CENTROIDS.

Now we ' re going to actually plot and visualize the Machine's findings based on our data, and the fitment according to the Number of clusters we said to Find.

Colors= ["g.","r.","c.","y."]ForIInRange(Len(X)): Print("coordinate:",X[I], "label:",Labels[I])Plt.Plot(X[I][0],X[I][1],Colors[Labels[I]],Markersize= 10)Plt.scatter (centroids[:, 0],centroids[:, 1], marker =  "x" , S=150 , linewidths = 5, ZOrder = 10 plt. ()               

The above code is all "visualization" code, has nothing more than "to do" and machine learning than just showing us some resu Lts.

first, we have a "colors" List. This list would be used to being iterated through to get some custom colors for the resulting graph. Just a nice box of colors to Use.

We only need and colors at first, but soon we ' re going to ask the machine to classify into other numbers of groups just fo R Learning purposes, so I decided-put four choices Here. The period after the letters was just the type of "plot marker" to Use.

now, we're using a to for loop iterate through our Plots.

If You're Confused about the A for loop, need to first run through the Python 3 Basics series, or Specifica Lly the For Loop Basics Tutorial.

If you ' re confused about the actual code being used, especially with iterating through this loop, or the scatter plotting Code slices that look like this: [:, 0], then check out the Video. I explain them there.

The resulting graph, after the one, just shows the points, should look like:

Do you see the Voronoi cells? I Hope not, We didn ' t draw them. Remember, Those is the polygons that mark the divisions between Clusters. here, we had each of the plot marked, by color, and what the group it belongs to. We have also marked the centroids as Big Blue "x" Shapes.

As we can see, the machine was very successful! now, I encourage your to play with the N_clusters Variable. first, decide how many clusters you'll do, then try to predict where the centroids would be.

We do this in the video, choosing 3 and 4 Clusters. It's relatively easy-to-predict with these points if you understand how the algorithm works, and makes for a good learning Exercise.

That's it for my Flat clustering example for unsupervised learning, What about hierarchical clustering next?

K-means Clusternig example with Python and Scikit-learn (recommended)

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.