What is the difference between clustering and categorical discrimination?
Clustering scenarios: Finding high-quality customers
28 laws are everywhere
20% of users provide 80% of the bank's profit source
20% of users consume 80% of the total cost of the operator's phone
20% of the employees in the company completed 80% of the work
20% of the people in society have 80% of the power of speech.
Clustering Scenarios: Recommended Systems
Key metrics: Distance
Definition of distance
Common distance (Shiry book P469)
Absolute distance
Euclidean distance
Minkowski distance
Chebyshev Snow Distance
Markov distance
Lance and Williams distance
Distance calculation of discrete variables
Indicators for classifying variables: similarity coefficients
Distance: Classify the sample
Similarity coefficients: Classifying variables
Common similarity coefficients: angle cosine, correlation coefficient (Shiry book P475)
Obnoxious method for distance calculation between classes of non-classes
Shiry Book P476
Shortest distance method
The longest distance method
Median distance method
Class Averaging method
Center of gravity Method
From compensating obnoxious and FA
Dynamic Clustering: K-means Obnoxious method
Algorithm:
1 Select K points as the initial center of mass
2 assign each point to the nearest centroid, forming a k cluster (cluster)
3 recalculate the centroid of each cluster
4 repeat 2-3 until centroid does not change
Advantages and disadvantages of K-means algorithm
Efficient, and not susceptible to initial value selection
Unable to handle non-spherical clusters
Cannot handle clusters of different sizes and densities
Outliers may have a greater disturbance (so they should be removed first)
Technology based on representative points: K-centric Clustering method
Algorithm steps
1 randomly select K points as the "center point"
2 calculates the distance from the remaining points to the K-center points, each of which is assigned to the nearest center point to make up the cluster.
3 randomly selects a non-central point or, using it instead of an existing center point OJ, calculates the total cost of this substitution s
4 if s<0, then replace OJ with or to form a new set of K Center points
5 Repeat 2 until the center point collection does not change
CLARA
Cluster largeapllication for fast clustering of large data sets
Three basic ideas for big data processing, Keywords: sampling, accuracy, performance
Algorithm idea:
1 extracting a small number of samples from a large data set
2 Pam clustering of the extracted samples
3 from step 2 to obtain a cluster center, using this set of cluster center to cluster large data sets, classification principle is the sample points from the cluster center distance of the shortest group
Density-based obnoxious method: DBSCAN
DBSCAN = density-based Spatial Clustering of applications with Noise
This algorithm divides the area with sufficient density into clusters, and can discover the clustering of any shape
DBSCAN
Basic idea of algorithm
1 Specify the appropriate R and M
2 Calculate all the sample points, if there are more than M points in the R neighborhood of the point P, create a new cluster with p as the core point
3 Repeat the temple to find these core points direct density can reach (remainders may be the density can be reached) point, add it to the corresponding cluster, for the core point "density connected" condition of the cluster, give the merger
4 The algorithm ends when no new points can be added to the anything cluster
R-Neighborhood: the area within the radius r of the given point
Core point: If the R-neighborhood of a point contains at least a minimum number of M points, the point is said to be the core point
Direct density up to: if the point P in the R-neighborhood of the core Q, then p is from Q can be directly density can reach
If there is a point chain p1,p2, ..., pn,p1=q,pn=p,pi+1 is from Pi about R and M direct density can be reached, then the point P is from Q about R and M density can be reached
If the point O is present in the sample set D, so that the point P, Q is from O about R and M density, then the point P, Q is about R and M density connected
Input: Database containing n objects, radius e, minimum number of minpts;
Output: All generated clusters to achieve density requirements.
(1) Repeat
(2) Extracting an unhandled point from the database;
(3) If the point is the core point then find all the objects from that point density reachable, forming a cluster;
(4) ELSE the point is the edge point (non-core object), jump out of this cycle, the temple to find the next point;
(5) UNTIL all the points are processed.
Dbscan is sensitive to user-defined parameters, the difference between the gentry may find a very different result, and the choice of parameters can not be followed, only by experience to determine.
Machine Learning Week 9th-smelting number into gold-clustering