K-means Clustering algorithm describes how to use K-means Clustering algorithm

Source: Internet
Author: User

Prerequisite conditions

Specific areas of experience requirements: no

Professional experience Requirements: no industry experience

Knowledge of machine learning is not required, but readers should be familiar with basic data analysis (e.g., descriptive analysis). To practice This example, the reader should also be familiar with Python.

Introduction to K-means Clustering

K-means clustering is an unsupervised learning that is used when there is unlabeled data (for example, the data does not define a category or group). The goal of the algorithm is to find the grouping in the data, and the variable K represents the number of groups. The algorithm iteratively allocates each data point to one of the k groupings that provide the feature. Data points are clustered based on feature similarity. The results of the K-means Clustering algorithm are:

1. K-cluster centroid, which can be used to mark new data

2. Training data markers (each data point is assigned to a single cluster)

Clustering allows you to discover and analyze organically formed groups rather than define groups before viewing the data. The steps in the following example, "Select K", describe how to determine the number of groups.

Each centroid of a cluster is a collection of eigenvalues that define the resulting group. Checking centroid feature weights can be used to qualitatively explain what groups each cluster represents.

This K-means clustering algorithm covers:

    • Common business case for using K-means

    • Steps involved in running the algorithm

    • Using a python example that delivers fleet data

Commercial use

The K-means clustering algorithm is used to find groups that contain data that is not explicitly tagged. This can be used to determine business assumptions, what types of groupings exist, or which unknown groups are identified for a complex dataset. Once the algorithm has been run and defined groupings, any new data can easily be assigned to the correct group.

This is a generic algorithm that can be used for any type of grouping. Some examples of use cases are:

    • Behavioral Subdivision

      • Breakdown by Purchase history

      • Breakdown by activity on an application, Web site, or platform

      • Define a role based on interest

      • Create a chart based on activity monitoring

    • Inventory classification

      • By sales activity points

      • According to the production index points

    • Sensor Measurement Data classification

      • Detecting activity types in motion sensors

      • Picture grouping

      • Detach Audio

      • Grouping in health testing

    • Detecting robots or anomalies

      • A separate effective activity group from the robot

      • Clean up active activity groups for outlier detection

In addition, monitoring whether a tracked data point switches between groups over time can be used to detect meaningful changes in the data.

Algorithm

The Κ-means clustering algorithm uses iterative refinement to produce the final result. The input of the algorithm is the number of clustering κ and data sets. A DataSet is a collection of characteristics for each data point. This algorithm starts with the initial estimation of k particle, which can be randomly generated or randomly selected from the data set. The algorithm then iterates between the following two steps:

1. Data allocation step

Each particle defines a cluster. In this step, each data point is assigned to its nearest center of mass, based on the Euclidean distance squared. More precisely, if CI is a set of C centroid of a set, then each data point x is assigned to a cluster, based on

Dist (·) is the standard (L2) Euclidean distance. Data point sets are assigned to each ith cluster centroid si.

2. centroid Update steps:

In this step, the centroid is recalculated. This is accomplished by assigning the average value of all data points to the cluster of centroid.

The algorithm iterates between steps 1 and 2 until the stop Criterion is met (for example, no data points change the clustering, the distance is minimized, or the maximum number of iterations is reached).

The algorithm guarantees convergence to a result. The result may be a local optimization (for example, not necessarily the best result), meaning that using random starting centroid to evaluate multiple running algorithms may give better results.

Select K

The above algorithm looks for the clustering and a specific pre-selected K with tagged data. In order to find the number of clustering in the data, the user needs to run the K-means clustering algorithm in the K value range and compare the results. Overall, there is currently no method for determining k exact values, but accurate estimates can be obtained using the following techniques.

One of the measures commonly used to compare the results of different k values is the average distance between the data points and their center of mass. Since increasing the number of clustering will always reduce the distance of the number of positions, increase K total

is to reduce this metric when k reaches the extreme value when it is the same as the data points. Therefore, this metric cannot be used as the only indicator. Instead, the average distance to the center of mass as a function of K

The place "Elbow point", which is plotted and drastically reduced, can be used to approximate K.

There are many other technologies that confirm K, including cross-validation, information standards, information theory hopping, image method and G-means algorithm. In addition, the Cross group distribution of monitoring data points can be further understood

How the algorithm divides data for each k.

Example: applying K-means Clustering to distribution fleet data

For example, we will show how the K-means algorithm handles the sample data set of the distribution team driver data. For simplicity's sake, we'll only look at two driver features: Average daily driving distance and the average time percentage of a driver >5mph above the speed limit. Total

Body, the algorithm can be used for any number of features, as long as the number of data samples is much larger than the number of features.

Step 1: Clean and retrofit data

In this case, we've cleaned up and done some simple data transformations. The data sample is shown as Pandas Dataframe as follows.

The following chart shows the data sets of 4,000 drivers, the x axis represents the distance feature, and the Y-axis represents the acceleration feature.

Step 2: Select K and run the algorithm

First choose K=2. For this example, use the Python package Scikit-learn and numpy for calculations, as follows:

The category token is returned with Kmeans.labels_.

Step 3: Review of results

The results are shown below. You can see that the K-means algorithm is divided into two groups based on the distance feature. Each cluster centroid is marked with a star.

    • Center of mass of Group 1 = (49.62, 8.72)

    • Center of mass of Group 2 = (179.73, 17.95)

Using the domain knowledge of the dataset, we can infer that group 1 is an urban driver, and group 2 is a village driver.

Step 4: Iterate over the values of K

Test the results of the k=4. To accomplish this, all you need to change is the number of targets in the Kmeans () function.

The following chart shows the results of clustering. We saw that the algorithm identified four different groups; the drivers of speeding are now separated from those who observe the speed limit, except the difference between the countryside and the city. The speed threshold of urban driving group is higher than that of rural drivers

Low, probably because city drivers spend more time on intersections and parking traffic.

Additional Notes and alternatives

Feature Engineering

Feature engineering is the process of using domain knowledge to select which data metrics are used as feature inputs into machine learning algorithms. Feature engineering plays a key role in K-means clustering, and it is essential to capture the variability of data using meaningful features, and to identify all naturally generated groups for the algorithm.

Categorical data (for example, category labels such as sex, country, browser type) need to be encoded or separated from the data in a way that still allows the algorithm to be used.

Feature transformations, in particular representing interest rates rather than measurements, can help normalize data. For example, in the example of delivering a convoy above, if the total driving distance is used instead of the average distance per day, the driver is driven by how long they are driving in the company instead of the rural and urban groupings.

Alternative scenarios

There are many alternative clustering algorithms, including Dbscan, spectral clustering, and Gaussian mixture modeling. dimensionality reduction techniques, such as principal component analysis, can be used for different group patterns in the data. One possible result is that there is no organic cluster in the data;

Some data along a single group of contiguous feature ranges. In this case, you may need to review the data features to see if you need to include different measurements or feature transformations to better represent the variability of the data. In addition, you may want to impose a classification or

Mark and modify your analytical methods based on the knowledge domain.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.