What is the difference between text classification and clustering?

Last Update:2018-12-05 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

To put it simply, classification automatically identifies an article or text and matches and determines a piece of text based on a prior category. Clustering is a technology that compares similarity between a group of articles or text information and classifies similar articles or text information into the same group. Classification and clustering are the process of classifying similar objects. The difference is that the category is defined in advance, and the number of categories remains unchanged. The classifier must be trained by manually labeled training corpus and belongs to the category of Guided Learning. Clustering does not have a pre-defined category, and the number of categories is uncertain. No manual tagging or pre-training classifier is required for clustering. Classes are automatically generated during clustering. Classification is suitable for situations where categories or classification systems have been defined, such as classification books by country chart; Clustering is suitable for situations where there is no classification system or the number of categories is uncertain. It is generally used as the front-end of some applications, for example, multi-document summarization and post-search engine result clustering (meta-search.

Classification is a model (or function) that identifies and describes data classes or concepts, so that you can use model prediction classes to mark unknown object classes. Classification Technology is an important task in Data Mining. Currently, it is most widely used in business. The purpose of classification is to learn a classification function or classification model (also known as classifier). This model can map data items in the database to a class in a given category.

To construct a classifier, you must have a training sample dataset as the input. A training set consists of a set of database records or tuples, each of which is a feature vector consisting of values of relevant fields (also known as attributes or features). In addition, a training sample also has a category tag. A specific sample can be expressed as follows: (V1, V2,..., vn; c). VI indicates the field value, and C indicates the category. Classifier construction methods include statistical methods, machine learning methods, and neural network methods.

Different classifiers have different characteristics. There are three classifier evaluation or comparison scales: 1) prediction accuracy; 2) computing complexity; 3) model description conciseness. Prediction accuracy is the most widely used comparison scale, especially for prediction classification tasks. Computing complexity depends on the specific implementation details and hardware environment. in data mining, because the operation object is a massive amount of data, the complexity of space and time will be a very important part. For descriptive classification tasks, the simpler the model description, the more popular it is.

In addition, it should be noted that the classification effect is generally related to the characteristics of data. Some data have high noise, some have vacant values, some are sparse, and some have strong correlation between fields or attributes, some attributes are discrete, while others are continuous values or hybrid. At present, it is widely believed that there is no method that can be suitable for data with various characteristics.

Clustering refers to integrating non-class samples into different groups based on the principle of "Object-based clustering". Such a set of data objects is called a cluster, and describe each of these clusters. The purpose is to make the samples of the same cluster should be similar to each other, and the samples of different clusters should be not similar enough. Unlike classification rules, before clustering, you do not know which groups you want to divide into or what groups you want to define, or which spaces are used to differentiate rules. The objective is to discover the functional relationships between attributes of a spatial object. The knowledge of mining is expressed by mathematical equations of attributes named variables. Clustering technology is booming in the fields of data mining, statistics, machine learning, spatial database technology, biology, and marketing, clustering Analysis has become an active research topic in the field of data mining. Common clustering algorithms include K-means, K-center, CLARANS, birch, clique, and DBSCAN.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

What is the difference between text classification and clustering?

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

What is the difference between text classification and clustering?

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support