Machine Learning Week 9th-smelting number into gold-clustering

Last Update:2016-04-23 Source: Internet

Author: User

Tags new set

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

What is the difference between clustering and categorical discrimination?

Clustering scenarios: Finding high-quality customers

28 laws are everywhere
20% of users provide 80% of the bank's profit source
20% of users consume 80% of the total cost of the operator's phone
20% of the employees in the company completed 80% of the work
20% of the people in society have 80% of the power of speech.

Clustering Scenarios: Recommended Systems

Key metrics: Distance

Definition of distance
Common distance (Shiry book P469)

Absolute distance
Euclidean distance
Minkowski distance
Chebyshev Snow Distance
Markov distance
Lance and Williams distance
Distance calculation of discrete variables

Indicators for classifying variables: similarity coefficients
Distance: Classify the sample
Similarity coefficients: Classifying variables
Common similarity coefficients: angle cosine, correlation coefficient (Shiry book P475)

Obnoxious method for distance calculation between classes of non-classes

Shiry Book P476
Shortest distance method
The longest distance method
Median distance method
Class Averaging method
Center of gravity Method
From compensating obnoxious and FA

Dynamic Clustering: K-means Obnoxious method

Algorithm:
1 Select K points as the initial center of mass
2 assign each point to the nearest centroid, forming a k cluster (cluster)
3 recalculate the centroid of each cluster
4 repeat 2-3 until centroid does not change

Advantages and disadvantages of K-means algorithm

Efficient, and not susceptible to initial value selection
Unable to handle non-spherical clusters
Cannot handle clusters of different sizes and densities
Outliers may have a greater disturbance (so they should be removed first)

Technology based on representative points: K-centric Clustering method

Algorithm steps
1 randomly select K points as the "center point"
2 calculates the distance from the remaining points to the K-center points, each of which is assigned to the nearest center point to make up the cluster.
3 randomly selects a non-central point or, using it instead of an existing center point OJ, calculates the total cost of this substitution s
4 if s<0, then replace OJ with or to form a new set of K Center points
5 Repeat 2 until the center point collection does not change

CLARA
Cluster largeapllication for fast clustering of large data sets
Three basic ideas for big data processing, Keywords: sampling, accuracy, performance
Algorithm idea:
1 extracting a small number of samples from a large data set
2 Pam clustering of the extracted samples
3 from step 2 to obtain a cluster center, using this set of cluster center to cluster large data sets, classification principle is the sample points from the cluster center distance of the shortest group

Density-based obnoxious method: DBSCAN

DBSCAN = density-based Spatial Clustering of applications with Noise
This algorithm divides the area with sufficient density into clusters, and can discover the clustering of any shape

DBSCAN
Basic idea of algorithm
1 Specify the appropriate R and M
2 Calculate all the sample points, if there are more than M points in the R neighborhood of the point P, create a new cluster with p as the core point
3 Repeat the temple to find these core points direct density can reach (remainders may be the density can be reached) point, add it to the corresponding cluster, for the core point "density connected" condition of the cluster, give the merger
4 The algorithm ends when no new points can be added to the anything cluster

R-Neighborhood: the area within the radius r of the given point
Core point: If the R-neighborhood of a point contains at least a minimum number of M points, the point is said to be the core point
Direct density up to: if the point P in the R-neighborhood of the core Q, then p is from Q can be directly density can reach
If there is a point chain p1,p2, ..., pn,p1=q,pn=p,pi+1 is from Pi about R and M direct density can be reached, then the point P is from Q about R and M density can be reached
If the point O is present in the sample set D, so that the point P, Q is from O about R and M density, then the point P, Q is about R and M density connected

Input: Database containing n objects, radius e, minimum number of minpts;
Output: All generated clusters to achieve density requirements.
(1) Repeat
(2) Extracting an unhandled point from the database;
(3) If the point is the core point then find all the objects from that point density reachable, forming a cluster;
(4) ELSE the point is the edge point (non-core object), jump out of this cycle, the temple to find the next point;
(5) UNTIL all the points are processed.
Dbscan is sensitive to user-defined parameters, the difference between the gentry may find a very different result, and the choice of parameters can not be followed, only by experience to determine.

Machine Learning Week 9th-smelting number into gold-clustering

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More