Prerequisite conditions
Specific areas of experience requirements: no
Professional experience Requirements: no industry experience
Knowledge of machine learning is not required, but readers should be familiar with basic data analysis (e.g., descriptive analysis). To practice This example, the reader should also be familiar with Python.
Introduction to K-means Clustering
K-means clustering is an unsupervised learning that is used when there is unlabeled data (for example, the data does not define a category or group). The goal of the algorithm is to find the grouping in the data, and the variable K represents the number of groups. The algorithm iteratively allocates each data point to one of the k groupings that provide the feature. Data points are clustered based on feature similarity. The results of the K-means Clustering algorithm are:
1. K-cluster centroid, which can be used to mark new data
2. Training data markers (each data point is assigned to a single cluster)
Clustering allows you to discover and analyze organically formed groups rather than define groups before viewing the data. The steps in the following example, "Select K", describe how to determine the number of groups.
Each centroid of a cluster is a collection of eigenvalues that define the resulting group. Checking centroid feature weights can be used to qualitatively explain what groups each cluster represents.
This K-means clustering algorithm covers:
Common business case for using K-means
Steps involved in running the algorithm
Using a python example that delivers fleet data
Commercial use
The K-means clustering algorithm is used to find groups that contain data that is not explicitly tagged. This can be used to determine business assumptions, what types of groupings exist, or which unknown groups are identified for a complex dataset. Once the algorithm has been run and defined groupings, any new data can easily be assigned to the correct group.
This is a generic algorithm that can be used for any type of grouping. Some examples of use cases are:
In addition, monitoring whether a tracked data point switches between groups over time can be used to detect meaningful changes in the data.
Algorithm
The Κ-means clustering algorithm uses iterative refinement to produce the final result. The input of the algorithm is the number of clustering κ and data sets. A DataSet is a collection of characteristics for each data point. This algorithm starts with the initial estimation of k particle, which can be randomly generated or randomly selected from the data set. The algorithm then iterates between the following two steps:
1. Data allocation step
Each particle defines a cluster. In this step, each data point is assigned to its nearest center of mass, based on the Euclidean distance squared. More precisely, if CI is a set of C centroid of a set, then each data point x is assigned to a cluster, based on
Dist (·) is the standard (L2) Euclidean distance. Data point sets are assigned to each ith cluster centroid si.
2. centroid Update steps:
In this step, the centroid is recalculated. This is accomplished by assigning the average value of all data points to the cluster of centroid.
The algorithm iterates between steps 1 and 2 until the stop Criterion is met (for example, no data points change the clustering, the distance is minimized, or the maximum number of iterations is reached).
The algorithm guarantees convergence to a result. The result may be a local optimization (for example, not necessarily the best result), meaning that using random starting centroid to evaluate multiple running algorithms may give better results.
Select K
The above algorithm looks for the clustering and a specific pre-selected K with tagged data. In order to find the number of clustering in the data, the user needs to run the K-means clustering algorithm in the K value range and compare the results. Overall, there is currently no method for determining k exact values, but accurate estimates can be obtained using the following techniques.
One of the measures commonly used to compare the results of different k values is the average distance between the data points and their center of mass. Since increasing the number of clustering will always reduce the distance of the number of positions, increase K total
is to reduce this metric when k reaches the extreme value when it is the same as the data points. Therefore, this metric cannot be used as the only indicator. Instead, the average distance to the center of mass as a function of K
The place "Elbow point", which is plotted and drastically reduced, can be used to approximate K.
There are many other technologies that confirm K, including cross-validation, information standards, information theory hopping, image method and G-means algorithm. In addition, the Cross group distribution of monitoring data points can be further understood
How the algorithm divides data for each k.
Example: applying K-means Clustering to distribution fleet data
For example, we will show how the K-means algorithm handles the sample data set of the distribution team driver data. For simplicity's sake, we'll only look at two driver features: Average daily driving distance and the average time percentage of a driver >5mph above the speed limit. Total
Body, the algorithm can be used for any number of features, as long as the number of data samples is much larger than the number of features.
Step 1: Clean and retrofit data
In this case, we've cleaned up and done some simple data transformations. The data sample is shown as Pandas Dataframe as follows.
The following chart shows the data sets of 4,000 drivers, the x axis represents the distance feature, and the Y-axis represents the acceleration feature.
Step 2: Select K and run the algorithm
First choose K=2. For this example, use the Python package Scikit-learn and numpy for calculations, as follows:
The category token is returned with Kmeans.labels_.
Step 3: Review of results
The results are shown below. You can see that the K-means algorithm is divided into two groups based on the distance feature. Each cluster centroid is marked with a star.
Center of mass of Group 1 = (49.62, 8.72)
Center of mass of Group 2 = (179.73, 17.95)
Using the domain knowledge of the dataset, we can infer that group 1 is an urban driver, and group 2 is a village driver.
Step 4: Iterate over the values of K
Test the results of the k=4. To accomplish this, all you need to change is the number of targets in the Kmeans () function.
The following chart shows the results of clustering. We saw that the algorithm identified four different groups; the drivers of speeding are now separated from those who observe the speed limit, except the difference between the countryside and the city. The speed threshold of urban driving group is higher than that of rural drivers
Low, probably because city drivers spend more time on intersections and parking traffic.
Additional Notes and alternatives
Feature Engineering
Feature engineering is the process of using domain knowledge to select which data metrics are used as feature inputs into machine learning algorithms. Feature engineering plays a key role in K-means clustering, and it is essential to capture the variability of data using meaningful features, and to identify all naturally generated groups for the algorithm.
Categorical data (for example, category labels such as sex, country, browser type) need to be encoded or separated from the data in a way that still allows the algorithm to be used.
Feature transformations, in particular representing interest rates rather than measurements, can help normalize data. For example, in the example of delivering a convoy above, if the total driving distance is used instead of the average distance per day, the driver is driven by how long they are driving in the company instead of the rural and urban groupings.
Alternative scenarios
There are many alternative clustering algorithms, including Dbscan, spectral clustering, and Gaussian mixture modeling. dimensionality reduction techniques, such as principal component analysis, can be used for different group patterns in the data. One possible result is that there is no organic cluster in the data;
Some data along a single group of contiguous feature ranges. In this case, you may need to review the data features to see if you need to include different measurements or feature transformations to better represent the variability of the data. In addition, you may want to impose a classification or
Mark and modify your analytical methods based on the knowledge domain.