Introduction to text clustering algorithms, text clustering algorithms

Source: Internet
Author: User

Introduction to text clustering algorithms, text clustering algorithms

Reprinted please indicate the source: http://blog.csdn.net/xiaojimanman/article/details/44977889

Http://www.llwjy.com/blogdetail/41b268618a679a6ec9652f3635432057.html

The personal blog website has been launched. Its website is www.llwjy.com ~ Thank you ~
Bytes -------------------------------------------------------------------------------------------------


This blog introduces how to make a clustering algorithm for unstructured data (documents) by analyzing the mature clustering algorithms. The content of the first part is from Baidu encyclopedia, the second part introduces the idea of text clustering algorithms. For various reasonsNo specific code implementation is providedIf you are interested, you can leave a message to discuss it later.


######################################## ######################################## ###
##### The following content is about clustering. It is from Baidu encyclopedia. If you already know it, skip to the next section.
######################################## ######################################## ###


Clustering concept
Clustering Analysis, also known as group analysis, is a statistical analysis method used to study Classification Issues (samples or indicators). It is also an important algorithm for data mining. Cluster analysis is composed of several Pattern patterns. Generally, a Pattern is a Measurement vector or a point in a multi-dimensional space. Clustering analysis is based on similarity. The patterns in a cluster have more similarity than those in different clusters.

Algorithm usage
In terms of business, clustering can help market analysts distinguish different consumer groups from consumer databases and summarize the consumption patterns or habits of each type of consumers. As a module in data mining, it can be used as a separate tool to discover some deep information distributed in the database and summarize the characteristics of each category, you can also focus on a specific category for further analysis. In addition, clustering analysis can also be used as a preprocessing step for other analysis algorithms in data mining algorithms.
Clustering analysis algorithms can be divided into partition method, Hierarchical Methods, density-based Methods, and grid-based methods) model-Based Methods ).

Algorithm Classification
It is difficult to propose a concise classification for clustering methods, because these classes may overlap, so that a method has several types of features, it is still useful to provide a relatively organized description for different clustering methods. There are mainly the following types of clustering analysis and calculation methods:

Division
Partitioning methods. Given a dataset with N tuples or records, the splitting method constructs K groups. Each group represents a clustering, K <N. The K groups meet the following conditions:
(1) Each group should contain at least one data record;
(2) Each data record belongs to and belongs to only one group (Note: This requirement can be relaxed in some fuzzy clustering algorithms );
For a given K, the algorithm first provides an initial grouping method, and later changes the grouping through iterative methods, so that each improved grouping scheme is better than the previous one, the so-called good standard is: the closer the records in the same group, the better the records in different groups.
Most of the division methods are based on distance. Given the number of partitions to be built k, the partitioning method first creates an initialization partition. Then, it uses an iterative Relocation Technique to divide objects by moving them from one group to another. The general preparation for a good division is: objects in the same cluster are as close or related as possible, and objects in different clusters are as far away as possible or different. There are also many other criteria for judging the quality of division. The traditional partitioning method can be extended to the sub-space clustering instead of searching the entire data space. This is useful when many attributes exist and data is sparse. In order to achieve global optimization, the partition-based clustering may need to enumerate all possible partitions, resulting in a huge amount of computing. In fact, most applications use popular heuristic methods, such as k-means and k-center algorithms, to incrementally improve the clustering quality and approximate the local optimal solution. These Heuristic Clustering Methods are suitable for discovering spherical clusters in Small and Medium-sized databases. To discover clusters with complex shapes and cluster super large datasets, we need to further expand the Division-based method.
Algorithms using this basic idea include: K-MEANS algorithm, K-MEDOIDS algorithm, CLARANS algorithm;

Hierarchy
Hierarchical methods performs hierarchical Decomposition on a given dataset until certain conditions are met. It can be divided into two schemes: "bottom-up" and "top-down.
For example, in the "bottom-up" scheme, each data record in the initial stage forms a separate group. In the next iteration, it combines adjacent records into a group until all records form a group or a condition is met.
Hierarchical Clustering can be based on distance or density or connectivity. Some extensions of the hierarchical clustering method also take into account the sub-space clustering. The disadvantage of a hierarchical method is that once a step (merge or split) is completed, it cannot be undone. This strict rule is useful because there is no need to worry about the number of combinations selected, and it will produce a small computing overhead. However, this technology cannot correct the wrong decision. Some methods have been proposed to improve the quality of hierarchical clustering.
Representative algorithms include BIRCH algorithm, CURE algorithm, and CHAMELEON algorithm;

Density Algorithm
Density-based methods, a fundamental difference between density-based methods and other methods is that they are not based on a variety of distances, but on density. In this way, we can overcome the disadvantages that distance-based algorithms can only find "Circular-like" clusters.
The guiding ideology of this method is that as long as the point density in a region exceeds a certain threshold, it is added to the similar clustering.
Representative algorithms include the DBSCAN algorithm, OPTICS algorithm, and DENCLUE algorithm;

Graph Theory Clustering
The first step to solve graph clustering is to create a graph that is compatible with the problem. The graph node corresponds to the smallest unit of the analyzed data, and the graph edge (or arc) corresponds to the similarity measurement between the minimum processing unit metadata. Therefore, each minimum processing unit metadata has a metric expression, which ensures that the local features of data are easier to process. Graph Theory clustering is based on the local join feature of sample data as the main information source of clustering. Therefore, it is easy to process local data.

Grid Algorithm
Grid-based methods divides the data space into a grid structure with limited cells, all processing is based on a single unit. A major advantage of such processing is that the processing speed is very fast, which is usually unrelated to the number of records in the target database. It is only related to the number of units that divide the data space.
Representative algorithms include the STING algorithm, CLIQUE algorithm, and WAVE-CLUSTER algorithm;

Model Algorithm
The model-based method assumes a model for each cluster, and then searches for a dataset that can well satisfy the model. Such a model may be a function for density distribution of data points in space or other data points. A potential assumption is that the target dataset is determined by a series of probability distributions.
There are usually two ways to try: Statistical Solutions and neural networks.


######################################## ######################################## ###
##### Analyze the text clustering method
######################################## ######################################## ###


Text clustering
At present, most of the research on clustering algorithms is based on structured data, and few are targeted at unstructured data. Here I will introduce my own research on this. Currently, the source code involves a series of problems, so here we will only introduce the idea and do not provide the source code. If you want to learn more, leave a message below.
Text clustering Document clustering is mainly based on the famous clustering hypothesis: similar documents have a large similarity, while different types of documents have a small similarity. As an unsupervised machine learning method, clustering has certain flexibility and high automatic processing capability because it does not require the training process or manual classification of documents in advance, it has become an important means to effectively organize, abstract, and navigate text information.
The focus of this introduction is to introduce how to clustering unstructured text.

Text clustering
As the research on the clustering of structured data has been very mature, we need to find a way to convert this unstructured data into structured data, which may be well processed.
Because I work in a search engine, some of my algorithm ideas are based on it. For how to convert unstructured data into structured data, you can refer to the following blog "lucene-based case development: Index mathematical model".
The following describes the specific algorithms:
Step 1: Record Word Segmentation
To simplify the model, only one attribute exists in the text by default. In this step, we need to initialize and analyze all documents. During the process, we need to calculate the following values:Which of the N documents contain the word Meta, the number of times the word meta M appears in the N documents, the number of times the word meta M appears in the N documents, and the word meta M appears in all number of times in the document. In this step, you need to use Word Segmentation technology. When Processing Chinese characters, we recommend that you use IK and other Chinese Word Segmentation tools. The processing effect of other general word segmentation tools is not very good. This step converts the Document to Document = {term1, term2, term3 ...... TermN };
Step 2: Calculate the weight
The weight calculation method here is slightly different from the previous one. The formula is as follows:


After processing this step, we set Document = {term1, term2, term3 ...... TermN} is converted to DocumentVector = {weight1, weight2, weight3 ...... WeightN}
Step 3: n-dimensional space vector model
We put the DocumentVector obtained in step 2 into the N-dimensional space vector model (N is the total number of word elements). The ing of document D on m coordinates is the weight of m word elements in document D, for example:


Step 4: most similar documents
In the n-dimensional space vector model, the smaller the angle, the more similar the two documents. In this step, we need to find the two vectors at the bottom of the two angles (that is, the two most similar documents );
Step 5: Merge documents
Consider the two documents obtained in Step 4 as one document (that is, the two documents are regarded as one category );
Step 6: Verify
Determine whether the number of documents meets the requirements (whether the number of remaining documents is equal to the number of categories to be clustered, go to step 2 and cycle 2, 3, 4, 5, and 6.


Algorithm Evaluation

Currently, the test result on the notebook (generally configured with 4 GB memory) is clustering.1 WThe document is time-consuming40 s ~ 50 sIs the clustering effect for 10 pieces of data:


Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.