Introduction to Text Clustering algorithm

Source: Internet
Author: User

Reprint Please specify source: http://blog.csdn.net/xiaojimanman/article/details/44977889

Http://www.llwjy.com/blogdetail/41b268618a679a6ec9652f3635432057.html

Personal Blog Station has been online, the website www.llwjy.com ~ welcome you to vomit Groove ~
-------------------------------------------------------------------------------------------------


This blog through the current more mature clustering algorithm analysis, how to do the unstructured data (document) To do clustering algorithm, the first part of the content of the source of Baidu Encyclopedia, the second section is the text clustering algorithm idea introduction. Here for a variety of reasons do not give specific code implementation , if interested, you can leave a message to discuss together.


###################################################################################
# # # # #以下内容为聚类介绍, the source Baidu Encyclopedia, if already understood, can directly ignore jumps to the next part
###################################################################################


Clustering Concepts
Cluster analysis, also known as group analysis, is a statistical analysis method for the research (sample or index) classification problem, and it is also an important algorithm for data mining. Cluster (Cluster) analysis is made up of patterns (pattern), usually a vector of a metric (measurement), or a point in a multidimensional space. Cluster analysis is based on similarity, and there is more similarity between patterns in a cluster than patterns that are not in the same cluster.

Algorithmic Use
In business, clustering can help market analysts separate different consumer groups from the consumer database, and summarize the consumption patterns or habits of each consumer. As a module in data mining, it can be used as a separate tool to discover some deep information distributed in the database, and to summarize the characteristics of each class, or to focus on a particular class for further analysis; Cluster analysis can also be used as a preprocessing step for other analytic algorithms in data mining algorithms.
The clustering algorithm can be divided into partition method (partitioning Methods), hierarchical method (hierarchical Methods), density-based method (density-based Methods), Grid-based method (grid-based methods), model-based approach (model-based methods).

Algorithm Classification
It is difficult to make a concise classification of clustering methods, because these categories may overlap, thus allowing a method to have several types of characteristics, however, for a variety of different clustering methods to provide a relatively organized description is still useful, for clustering analysis methods are mainly as follows:

Division Method
Partition method (partitioning methods), given a data set with N tuples or records, the splitting method will construct K groups, each grouping represents a cluster, k<n. And this K-grouping satisfies the following conditions:
(1) Each packet contains at least one data record;
(2) Each data record belongs to and belongs to only one grouping (note: This requirement can be relaxed in some fuzzy clustering algorithms);
For a given k, the algorithm first gives an initial grouping method, the subsequent iteration of the method to change the grouping, so that after each improvement of the grouping scheme is better than the previous one, and the so-called Good standard is: the same group of records as close as possible, and the record in different groups as far as possible.
Most of the partitioning method is based on distance. Given the number of partitions to build K, the partitioning method first creates an initialization partition. It then uses an iterative relocation technique that divides objects from one group to another. The general preparation for a good division is that objects in the same cluster are as close or relevant as possible to each other, while objects in different clusters are as far away or different as possible. There are also many other criteria for judging the quality of the divisions. The traditional partitioning method can be extended to subspace clustering, rather than searching the entire data space. This is useful when there are many properties and the data is sparse. In order to achieve the global optimal, the clustering based on partitioning may need to be exhaustive and the computational amount is very large. In fact, most applications use popular heuristic methods, such as K-Means and K-center algorithm, to improve the clustering quality and approximate the local optimal solution. These heuristic clustering methods are suitable for the discovery of spherical clusters in small-scale databases of small and medium-sized databases. In order to discover clusters with complex shapes and to cluster very large datasets, the partitioning-based approach needs to be further expanded.
The algorithm using this basic idea is: K-means algorithm, k-medoids algorithm, Clarans algorithm;

Hierarchical Method
Hierarchical method (hierarchical methods), which decomposes a given set of data in a hierarchical manner until a certain condition is satisfied. Concrete can be divided into "bottom-up" and "top-down" two schemes.
For example, in the bottom-up scenario, each data record is initially formed into a separate group, and in the next iteration it merges the groups that are adjacent to each other into a group until all the records are grouped together or a condition is met.
Hierarchical clustering methods can be distance-based or density-or connectivity-based. Some extensions of hierarchical clustering methods also consider subspace clustering. The drawback of a hierarchical approach is that once a step (merge or Split) is complete, it cannot be undone. This strict rule is useful because you don't have to worry about the number of combinations of different choices, it will produce a smaller computational overhead. However, this technique does not correct the wrong decision. Some methods to improve the quality of hierarchical clustering have been put forward.
The representative algorithms are: Birch algorithm, cure algorithm, chameleon algorithm, etc.

Density Algorithm
Density-based approach (density-based methods), a fundamental difference between density-based and other methods is that it is not based on a variety of distances, but on density. This overcomes the disadvantage that distance-based algorithms can only find clusters of "class circles".
The guiding principle of this method is that as long as the density of a point in an area is greater than a certain threshold, it is added to the cluster that is similar to it.
The representative algorithms are: Dbscan algorithm, optics algorithm, denclue algorithm, etc.

graph theory and cluster method
The first step in the solution of graph theory is to establish a graph that corresponds to the problem, the nodes of the graph correspond to the smallest unit of the analyzed data, and the Edge (or arc) of the graph corresponds to the similarity measure between the minimum processing unit data. Therefore, there is a metric representation between each of the smallest processing unit data, which ensures that the local characteristics of the data are relatively easy to handle. Graph theory Clustering method is the main information source of clustering based on the local connection feature of sample data, so its main advantage is that it is easy to deal with the characteristics of partial data.

Grid Algorithm
Grid-based method (grid-based methods), this method first divides the data space into a finite unit (cell) of the grid structure, all the processing is a single unit of the object. One of the outstanding advantages of this approach is that the processing speed is very fast, which is usually irrelevant to the number of records in the target database, which is only related to the number of cells that divide the data space.
The representative algorithm has: Sting algorithm, clique algorithm, Wave-cluster algorithm;

Model Algorithm
Model-based approach (model-based methods), the model-based approach assumes a model for each cluster, and then finds a data set that satisfies the model well. Such a model might be a density distribution function or other of data points in space. One of its underlying assumptions is that the target dataset is determined by a series of probability distributions.
There are usually two directions to try: The statistical scheme and the neural network scheme.


###################################################################################
# # # #以下内容为文本聚类方法分析
###################################################################################


text Clustering
      Most of the current research on clustering algorithms is based on structured data, and there's little to do with unstructured data, so here's a look at your own research. Because the source code is currently involved in a series of problems, so here just introduce ideas, do not provide the source code, if you want to know more, you can leave a message below.
      Text clustering document clustering is mainly based on the well-known clustering hypothesis: Similar document similarity is large, but the similarity of the document is small. As an unsupervised machine learning method, clustering has become an important means of effectively organizing, summarizing and navigating text information because it does not require the training process and does not require manual labeling of documents in advance.
The focus of our presentation is to introduce how to cluster unstructured text.

text Clustering ideas
      Because the current clustering of structured data is very mature, we're going to try to transform this unstructured data into structured data, This might be a good deal.
      because of its own work direction is the search engine, so some of their own algorithm ideas are based on it, for how to convert unstructured data into structured data, you can refer to the following blog "Lucene-based Case development: Index mathematical model."
The following is a description of the algorithm:
first step: Record participle
in order to simplify the model, we directly default to one text only one property. In this step, we need to do an initial analysis of all the documents, we need to count the following values: The nth document contains which words, the number of words in the nth document, m in document N occurrences, How many documents the word meta M appears in, and the number of times that the word meta M appears in all documents . In this step, we need to use the word segmentation technology, when dealing with Chinese, it is recommended to use IK and other Chinese word breakers, other universal word breaker processing is not very good. This step translates the document into documents = {TERM1, Term2, term3 ... termn};
second step: Calculate weights
Here's a slightly different method of calculating weights than before, with the following formula:


With this step, we convert the document = {Term1, Term2, term3 ... termn} to Documentvector = {weight1, weight2, weight3 ... weightn}
The third step: n-dimensional space vector model
We put the documentvector of the second step into the n-dimensional space vector model (n is the total number of words), and document D in m coordinates is the weight of the M-word element in document D, as follows:


Fourth step: the most similar document
In the n-dimensional space vector model, we specify that the smaller the angle, the more similar the two documents are. In this step, we need to find two vectors at the bottom of the two angle (the most similar two documents);
Fifth Step: Merging documents
Consider the two documents obtained in step fourth as a document (the two documents will be treated as a category);
Sixth step: Verify
Determine whether the number of documents at this time satisfies the requirement (the number of documents remaining is equal to the number of categories to be clustered), and if the request ends this algorithm and does not meet the requirements, skip to the second step Loop 2, 3, 4, 5, 6.


Algorithm evaluation

Currently in their work notebook (configuration general, memory 4G) on the test results are clustered 1W document Time spent in 40s~50s, is the 10 data clustering effect:


Introduction to Text Clustering algorithm

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.