Data Clustering Overview

Source: Internet
Author: User

[Introduction]

My research on Data Clustering aims to predict the file access mode based on clustering. Many systems regard data access requests as independent events. In fact, data requests are not completely random, but driven by user or program behavior. There is a specific access mode. Similar users have more or less the same access mode. Similar files are more likely to be accessed at the same time. Files in the same working set (which can be considered as a class) are often accessed in a transaction. Therefore, users or files need to be clustered based on the historical access information of files. On this basis, future access is predicted to reduce file access latency and improve system performance.

[Body]

"Things are clustered by groups ." The world of human cognition often starts from classifying known objects. Therefore, clustering is the most basic cognitive activity. Through proper clustering, things can be studied and the internal laws of things can be mastered by humans. Clustering means clustering of things according to certain properties of things, so that the similarity between classes is as small as possible, the similarity within the class is as large as possible, according to the degree of similarity, classify things (samples, objects, or variables) one by one.

From the statistical point of view, clustering analysis is a method to simplify data through data modeling, which plays an important role in multivariate data analysis. From the perspective of machine learning, clustering is an unsupervised learning process, while classification is a supervised learning process. The difference is that classification requires you to know the attribute values of classification in advance, the clustering algorithm automatically finds the classification attribute value. From the perspective of practical application, clustering has important applications in the fields of economics, biology, meteorology, medicine, Information Engineering, and engineering technology. It plays an important role in data mining, such as scientific data detection, information retrieval, Text Mining, spatial database analysis, Web data analysis, and customer relationship management.

Clustering should abstract the attributes of things and divide them into numerical attributes (quantitative) and symbolic attributes (qualitative). In case of symbol attributes, it should be converted into numerical post-processing. The basic idea of clustering analysis: defines similarity coefficient or distance between samples to represent the degree of similarity between samples. samples are classified one by one based on the degree of similarity. Each object is characterized by a set of indicators, which can be directly observed or historical statistical data. Common Traditional clustering algorithms include fuzzy clustering algorithms, K-means clustering algorithms, hierarchical clustering algorithms, and competitive clustering algorithms. The clustering analysis process consists of the following steps:

(1) Define sample indicators and obtain relevant statistical data
The choice of indicators is critical. We must be able to properly portray the similarity of things, which can be one or more indicators. For example, in Web Clustering, page Access frequency, access time, and access path can be used as indicators. Metric data mainly comes from historical statistical data and can be directly predicted.

(2) Standardized handling of indicators
To facilitate the analysis and comparison of indicator data and avoid data loss to the indicator, We need to standardize and normalize the indicator data. There are many methods for standardization, which can be selected based on the actual situation.

(3) construct a similarity Matrix and make appropriate modifications as needed
The similarity coefficient calculation method is used between two samples to calculate similarity on statistical indicators, and a similarity Matrix of [N x n] is constructed. Similarity coefficient calculation methods include quantitative product method, correlation coefficient method, exponential correlation coefficient method, maximum and minimum value method, arithmetic average and minimum values method, geometric average and minimum values method, absolute value index method, angle cosine method, and average gap method, expert scoring method, distance method, etc, select to use according to the actual problem. Some clustering algorithms have some special requirements on the similarity matrix, such as self-inversion, symmetry, and transmission. This requires proper transformation of the similarity matrix.

(4) use the corresponding method for clustering and test the effect
The main difference between different algorithms is that the clustering effect, performance, and algorithm complexity are different. Various algorithms have their own application fields. Select appropriate algorithms based on applications. We also need to analyze and test the clustering results to prove the merits of the algorithm.

[Document]

If you are interested in Data Clustering and want to learn more, you can refer to the paper "survey of clustering algorithms" Rui Xu, student member, IEEE and Donald Wunsch II, fellow, IEEE. This article is a summary of clustering algorithms, covering almost all fields of clustering algorithms. We will talk about clustering algorithms (layers, partitions, big datasets, graphics, text clustering, fuzzy Clustering, etc.) and issues related to clustering (how to calculate the distance, how to determine the number of clusters, how to evaluate the cluster results, etc ).

(Liu aigui/aiguille. Liu)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.