Machine Learning Week 9th-smelting number into gold-clustering

Source: Internet
Author: User
Tags new set

What is the difference between clustering and categorical discrimination?

Clustering scenarios: Finding high-quality customers

28 laws are everywhere
20% of users provide 80% of the bank's profit source
20% of users consume 80% of the total cost of the operator's phone
20% of the employees in the company completed 80% of the work
20% of the people in society have 80% of the power of speech.

Clustering Scenarios: Recommended Systems

Key metrics: Distance

Definition of distance
Common distance (Shiry book P469)

Absolute distance
Euclidean distance
Minkowski distance
Chebyshev Snow Distance
Markov distance
Lance and Williams distance
Distance calculation of discrete variables

Indicators for classifying variables: similarity coefficients
Distance: Classify the sample
Similarity coefficients: Classifying variables
Common similarity coefficients: angle cosine, correlation coefficient (Shiry book P475)

Obnoxious method for distance calculation between classes of non-classes

Shiry Book P476
Shortest distance method
The longest distance method
Median distance method
Class Averaging method
Center of gravity Method
From compensating obnoxious and FA

Dynamic Clustering: K-means Obnoxious method

Algorithm:
1 Select K points as the initial center of mass
2 assign each point to the nearest centroid, forming a k cluster (cluster)
3 recalculate the centroid of each cluster
4 repeat 2-3 until centroid does not change

Advantages and disadvantages of K-means algorithm

Efficient, and not susceptible to initial value selection
Unable to handle non-spherical clusters
Cannot handle clusters of different sizes and densities
Outliers may have a greater disturbance (so they should be removed first)

Technology based on representative points: K-centric Clustering method

Algorithm steps
1 randomly select K points as the "center point"
2 calculates the distance from the remaining points to the K-center points, each of which is assigned to the nearest center point to make up the cluster.
3 randomly selects a non-central point or, using it instead of an existing center point OJ, calculates the total cost of this substitution s
4 if s<0, then replace OJ with or to form a new set of K Center points
5 Repeat 2 until the center point collection does not change

CLARA
Cluster largeapllication for fast clustering of large data sets
Three basic ideas for big data processing, Keywords: sampling, accuracy, performance
Algorithm idea:
1 extracting a small number of samples from a large data set
2 Pam clustering of the extracted samples
3 from step 2 to obtain a cluster center, using this set of cluster center to cluster large data sets, classification principle is the sample points from the cluster center distance of the shortest group

Density-based obnoxious method: DBSCAN

DBSCAN = density-based Spatial Clustering of applications with Noise
This algorithm divides the area with sufficient density into clusters, and can discover the clustering of any shape

DBSCAN
Basic idea of algorithm
1 Specify the appropriate R and M
2 Calculate all the sample points, if there are more than M points in the R neighborhood of the point P, create a new cluster with p as the core point
3 Repeat the temple to find these core points direct density can reach (remainders may be the density can be reached) point, add it to the corresponding cluster, for the core point "density connected" condition of the cluster, give the merger
4 The algorithm ends when no new points can be added to the anything cluster

R-Neighborhood: the area within the radius r of the given point
Core point: If the R-neighborhood of a point contains at least a minimum number of M points, the point is said to be the core point
Direct density up to: if the point P in the R-neighborhood of the core Q, then p is from Q can be directly density can reach
If there is a point chain p1,p2, ..., pn,p1=q,pn=p,pi+1 is from Pi about R and M direct density can be reached, then the point P is from Q about R and M density can be reached
If the point O is present in the sample set D, so that the point P, Q is from O about R and M density, then the point P, Q is about R and M density connected

Input: Database containing n objects, radius e, minimum number of minpts;
Output: All generated clusters to achieve density requirements.
(1) Repeat
(2) Extracting an unhandled point from the database;
(3) If the point is the core point then find all the objects from that point density reachable, forming a cluster;
(4) ELSE the point is the edge point (non-core object), jump out of this cycle, the temple to find the next point;
(5) UNTIL all the points are processed.
Dbscan is sensitive to user-defined parameters, the difference between the gentry may find a very different result, and the choice of parameters can not be followed, only by experience to determine.

Machine Learning Week 9th-smelting number into gold-clustering

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.