Collective Smart Programming-discovering groups

Source: Internet
Author: User

This chapter describes how to use clustering algorithms to classify blogs.

First, construct data and find a group of blogs. Each Blog contains a group of words. This forms the data structure (blog-name, word.

In the process of constructing the data structure, you also need to delete the words that appear too widely. The delete method is to calculate the total number of times each word appears and the total number of blogs, if the ratio of the two exceeds a specific value, the word is too popular.

 

The distance between blogs is calculated. Like the previous chapter, there are two calculation methods. Euclidean distance and Pearson correlation coefficient.

 

Then there is a clustering algorithm. There are two clustering methods: Row-based and column-based.

Initialization: Each blog is a cluster with n clusters in total, forming the ters vector;

Clustering Algorithm:

For each cluster

For each other cluster

Calculate the distance between them

Find the minimum value of the preceding distance and the pair that forms the minimum value.

Delete the pair element in clusters and add pair as a new cluster to clusters. The new cluster contains the content of pair's two elements.

 

Through this algorithm, we can obtain the relevance between blogs, as shown in:

AB is relatively close, while de is closer to DC.

The method used to indicate clustering results is called dendragram, which displays clustering results in a hierarchical manner. How can we create a cluster as a dendragram diagram? A very important conclusion is that all nodes are on the leaves, so the number of leaf nodes determines the width and height of the image. Therefore, the following recursive expression can be obtained:

If the parent position is X, Y, up, and BOT, it indicates the number of upper and lower sons, and the height of each node is const, bot-D indicates the distance between the son and the parent;

The X of up is the constant of parent. x + up-D *.

Y of up is a constant of parent. Y + up */2.

Similarly, it calculates the bot. In this way, the shape of the entire node tree can be calculated.

 

-----------------------------------

The above section describes how to find the relationship between blogs Based on words, but sometimes it is interesting to find the relationship between words. If one word appears in blog1 and blog2, And the other word also appears in blog1 and blog2, we think there is a certain relationship between the two words.

The algorithm used to find this cluster is very simple. It is similar to the above algorithm, but it is just a transform for the vector.

----------------------------------

Then let's look at the kmeans algorithm.

Initialization: K points are randomly selected as the center point, and best-Match: {center: [point near the center point]} is created ]}

Iteration:

Traverse all vectors,

Traverse all centers and find the closest center point X of the vector.

Add the vector to the set of Big Best-Match [x]

Traverse all values of best-match. Each value is a set.

Find the center of the set and use it to replace the old center.

Calculates the sum of the distance between each point and its corresponding center point for the new best-match;

If the sum shows a downward trend, it continues. If it increases, it stops. The goal is to minimize the sum.

----------------------------------

Under the analysis, the row-based algorithm merges one point each time for a total of N rounds, and calculates the distance N * n from any two points each time, with a total complexity of N ^ 3.

The number of rounds in the kmeans algorithm is uncertain. Each time the complexity is N * k, and K represents the number of clusters. Therefore, when n is large, kmeans is less complex.

----------------------------------

Then how can we display the kmeans computing result to the graph? Suppose there are N points and the real distance between them is real (I, j). How can we present them on a plan?

You can use the gradient descent method.

First, a random position (XI, Yi) for I is assigned to each vertex;

Enter iteration:

For each I

For each J

Calculate the distance dist (I, j) of each point on the plan, compare it with real (I, j), and calculate the total absolute value of the error totalerror;

Calculate the Grad (I, j) vector that represents I to J.

Adjust the position of each vertex. If the distance between it and another vertex is too large, it will be closer to the vertex according to grad (I, j); otherwise, it will be farther away.

(ABC may be a straight line. A is too far away from B, but it is too close to C. When a is close to B, the error of C is greater, so as to enter the endless loop? In this case, the distance between B and C should be moved)

However, gradient descent may not be able to obtain the global optimal solution.

------------------------------------

In this way, we can implement the clustering algorithm in 2 and display the clustering result correctly.

Collective Smart Programming-discovering groups

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.