Data mining-cluster analysis summary

Last Update:2018-10-27 Source: Internet

Author: User

Tags crosstab

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Labels: Metric out scan alt Plot Distribution

Cluster Analysis

I. Concepts

Clustering Analysis classifies individuals based on their characteristics, so that individuals in the same category have a high degree of similarity and there is a big difference between different categories.

Cluster analysis belongsUnsupervised learning

Clustering objects can be divided into q-type Clustering and R-type clustering.

Q-type Clustering: Sample/record ClusteringToDistanceFor similarity indicators (Euclidean distance, Euclidean square distance, Markov distance, explicit distance, etc)

R-type Clustering: Indicator/variable ClusteringToSimilarity CoefficientFor similarity indicators (Pearson correlation coefficient, angle cosine, index correlation coefficient, etc)

Ii. Common Clustering Algorithms

K-means Division
Hierarchical Clustering
DBSCAN density method

1. K-means partitioning

K indicates the number of classes in the clustering algorithm, and means indicates the mean algorithm. K-means an algorithm that uses the mean algorithm to divide data into K classes.

Objective of K-means algorithmDivides n sample points into K classes so that each point is closest to itCentroid(The mean of all the sample points in a class) corresponds to the class, which is used as the clustering standard.

For the algorithm principles, see http://www.aboutyun.com/thread-18178-1-1.html#]

K-means algorithm calculation steps

Obtain K initial centers: Randomly extract K points from the data as the center of the initial cluster to represent each class
Divide each vertex into the corresponding class: Based on the European distance Minimum principle, divide each vertex into the class closest to it.
Recalculate the centroid: recalculates the centroid of each class based on the mean and other methods.
Iterative Computing Center: Repeat steps 2 and 3, Iterative Computing
Cluster completion: the cluster center is no longer moved

Implementation Based on sklearn package

Import the following data. Through the scatter plot and correlation coefficient between various variables, we can find that there is a strong positive correlation between the call length on weekdays and the total call length.

Select the modelable variables and reduce the dimension

Cloumns_fix1 = ['working day Call durations ', 'working day Call durations', 'weekend call length', and 'International call length ', 'Average Call durations '] # data dimension reduction pca_2 = PCA (n_components = 2) data_pca_2 = PD. dataframe (pca_2.fit_transform (data [cloumns_fix1])

Build a model using the K-means method in the sklearn package

# Drawing a scatter chart to view the general information of data points PLT. Scatter (data_pca_2 [0], data_pca_2 [1])

# Data points are expected to be classified into three types: kmmodel = kmeans (n_clusters = 3) # create model kmmodel = kmmodel. FIT (data [cloumns_fix1]) # Training Model pTARGET = kmmodel. predict (data [cloumns_fix1]) # mark the original data as PD. crosstab (pTARGET, pTARGET) # check the number of data in each category in a cross tabulation

PLT. Scatter (data_pca_2 [0], data_pca_2 [1], c = pTARGET) # view the cluster distribution

Finally, you can use a histogram to view the differences between clusters.

# View the differences between various types dmean = PD. dataframe (columns = cloumns_fix1 + ['category']) # obtain the average value of each category data_gb = data [cloumns_fix1]. groupby (pTARGET) # group by annotation I = 0for g in data_gb.groups: rmean = data_gb.get_group (g ). mean () rmean ['category'] = g; dmean = dmean. append (rmean, ignore_index = true) subdata = data_gb.get_group (g) for column in cloumns_fix1: I = I + 1; P = PLT. subplot (3, 5, I) p. set_title (column) p. set_ylabel (STR (G) + "classification") PLT. hist (subdata [column], bins = 20)

2. Hierarchical Clustering

　　Hierarchical Clustering is also called a tree clustering algorithm. Based on the distance between data, a layered architecture is used to aggregate data repeatedly and create a hierarchy to break down a given dataset. Hierarchical clustering algorithms are often used for automatic grouping of one-dimensional data.

Hierarchical Clustering is a very intuitive clustering algorithm. The basic idea is to reconnect each node through the similarity between data and sort by similarity from high to low, the entire process is to create a tree structure, such:

Steps of the hierarchical clustering algorithm:

Each data point is a separate class
Calculate the distance (similarity) between each point)
Connect to a pair based on the distance from small to large (strong to weak similarity) (calculate based on the average of two points after the connection as the new class) to obtain the tree structure.

Implementation Based on sklearn package

　　Data in K-means clustering case

Cloumns_fix1 = ['working day Call durations ', 'working day Call durations', 'weekend call length', and 'International call length ', 'Average Call durations '] linkage = hcluster. linkage (data [cloumns_fix1], method = 'centroid') # Calculate the center distance and obtain the matrix.

Linkage = scipy. Cluster. hierarchy. Linkage (data, method = 'singles ')

The method distance calculation formula has three parameters:
The shortest distance between two single classes
The distance between the complete classes from the longest point.
Centroid

# Hcluster. dendrogram (linkage) # If no parameter is set, all vertices are used as a base class for tree structure drawing.

# Because the data volume is large and the number of restricted classes is limited, 12 nodes are retained. brackets indicate the secondary nodes. The numbers in the brackets indicate the subnodes contained in the node.
Hcluster. dendrogram (linkage, truncate_mode = 'last', P = 12, leaf_font_size = 12 .)

# Mark the result of hierarchical clustering on the classes obtained by clustering, the number of clusters to be clustered, And the partitioning method (maxclust, maximum Partitioning Method)
PTARGET = hcluster. fcluster (linkage, 3, criterion = 'maxclust ')
# View the sample content of each category
PD. crosstab (pTARGET, pTARGET)

Drawing

# Using Principal Component Analysis for data Dimensionality Reduction pca_2 = PCA (n_components = 2) data_pca_2 = PD. dataframe (pca_2.fit_transform (data [cloumns_fix1]) PLT. scatter (data_pca_2 [0], data_pca_2 [1], c = pTARGET) # Draw a graph

3. DBSCAN density method　

Concept:

Density-based noise-based spatial clustering application algorithm, which defines a cluster as the largest set of density-related points and can divide areas with sufficient density into clusters, in addition, clusters of any shape can be found in the noise space data set.

Density: the density of any point in the space is the number of points contained in the campus with the point as the center and the EPS as the radius.

Neighborhood: the neighborhood of any point in the space is a collection of points contained in the campus area with the store as the center and the EPS as the radius.

Core point: the density of a point in the space. If it is greater than a given threshold value of minpts, it is called the core point. (Less than minpts is called a boundary point)

Noise point: neither the core point nor any point of the boundary point

Steps of the DBSCAN algorithm:

You can search for a cluster by checking the EPS neighborhood of each vertex in the dataset. If the EPS neighborhood of point P contains more than minpts, a cluster with P as the core is created.
Aggregate the points of these core points P from the EPS Through iteration, and then merge them into new clusters (possibly)
When no new vertex is added to the new cluster, the cluster is complete.

Advantages of the DBSCAN algorithm:

Fast clustering and effective processing of Spatial Clustering with any shape detected by noise points
You do not need to enter the number of clusters to be divided.
No bias in cluster shape
You can filter out noise when necessary.

Disadvantages of the DBSCAN algorithm:

Large data volumes require a large amount of memory and computing time
When the density of spatial clustering is uneven and the cluster spacing is large, the cluster quality is poor (minpts and EPS are difficult to select)
The algorithm effect depends on the distance formula. in actual application, Euclidean distance is often used. For high-latitude data, there is a "dimension disaster" https://baike.baidu.com/item/vender/6788619? Fr = Aladdin

Python implementation

1) Implementation of mathematical principles

　　Import a set of the following distributed data points

# Calculate the distance matrix between each point from sklearn. Metrics. Pairwise import euclidean_distancesdist = euclidean_distances (data)

Classify all vertices to obtain the core, boundary, and noise points.

# Set EPS and minpts
EPS = 1, 0.2
Minpts = 5

Ptses = [] for row in DIST: # Density
Density = NP. sum (row <EPS) PTS = 0 if density> minpts: # core point, density greater than 5
PTS = 1 Elif density> 1: # boundary point, density greater than 1 less than 5 PTS = 2 else: # Noise point, density: 1 PTS = 0 ptses. append (PTS)
# Obtain the classification of each vertex

Just in case, filter the noise points and calculate the new distance matrix.

# Filter out the noise points because the noise points cannot be clustered. They are a unique type of corepoints = data [pandas. Series (ptses )! = 0] coredist = euclidean_distances (corepoints)

Take each vertex as the core to obtain the neighborhood of the vertex.

cluster = dict()i = 0for row in coreDist:     cluster[i] = numpy.where(row<eps)[0]    i = i + 1

Then, combine the adjacent areas with intersections into new fields.

for i in range(len(cluster)):    for j in range(len(cluster)):        if len(set(cluster[j]) & set(cluster[i]))>0 and i!=j:            cluster[i] = list(set(cluster[i]) | set(cluster[j]))            cluster[j] = list()

Finally, finding the independent (that is, there is no intersection) neighborhood is the result of our final clustering.

Result = dict () J = 0for I in range (LEN (cluster): If Len (cluster [I])> 0: result [J] = Cluster [I] J = J + 1 # locate the serial number of the domain where each vertex is located and mark for I in range (LEN (result) as the result of their final clustering )): For J in result [I]: data. at [J, 'type'] = I PLT. scatter (data ['X'], data ['y'], c = data ['type'])

2) implementation based on sklearn package

eps = 0.2MinPts = 5model = DBSCAN(eps, MinPts)data[‘type‘] = model.fit_predict(data)plt.scatter(data[‘x‘],  data[‘y‘], c=data[‘type‘])

Data mining-cluster analysis summary

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More