Classification and clustering

Source: Internet
Author: User

Classification is like this:

It identifies models (or functions) that describe and distinguish data classes or concepts, so that the model prediction class can be used to mark unknown object classes. Classification analysis is an important task in Data Mining. Currently, it is most widely used in business. The purpose of classification is to learn a classification function or classification model (also known as classifier). This model can map data items in the database to a class in a given category. Both classification and regression can be used for prediction. The purpose of both is to automatically derive the promotion description of the given data from the historical data records, so as to predict the future data. Different from regression, the output of classification is a discrete class value, while the output of regression is a continuous value. The two are often represented in the form of decision trees. The data is searched from the root of the tree based on the data value, and the branches that meet the data are taken up to determine the category when they move to the leaf. To construct a classifier, you must have a training sample dataset as the input. A training set consists of a set of database records or tuples, each of which is a feature vector consisting of values of relevant fields (also known as attributes or features). In addition, a training sample also has a category tag. A specific sample can be expressed as follows: (V1, V2,..., vn; c). VI indicates the field value, and C indicates the category. Classifier construction methods include statistical methods, machine learning methods, and neural network methods. Different classifiers have different characteristics. There are three classifier evaluation or comparison scales: 1) prediction accuracy; 2) computing complexity; 3) model description conciseness. Prediction accuracy is the most widely used comparison scale, especially for prediction classification tasks. Computing complexity depends on the specific implementation details and hardware environment. in data mining, because the operation object is a massive amount of data, the complexity of space and time will be a very important part. For descriptive classification tasks, the simpler the model description, the more popular it is. In addition, it should be noted that the classification effect is generally related to the characteristics of data. Some data have high noise, some have vacant values, some are sparse, and some have strong correlation between fields or attributes, some attributes are discrete, while others are continuous values or hybrid. At present, it is widely believed that there is no method that can be suitable for data with various characteristics.

Clustering)

A collection of non-class samples is integrated into different groups based on the principle of "Object-based clustering". Such a collection of data objects is called a cluster, and describe each of these clusters. The purpose is to make the samples of the same cluster should be similar to each other, and the samples of different clusters should be not similar enough. Unlike classification rules, before clustering, you do not know which groups you want to divide into or what groups you want to define, or which spaces are used to differentiate rules. The objective is to discover the functional relationships between attributes of a spatial object. The knowledge of mining is expressed by mathematical equations of attributes named variables. Currently, clustering technology is booming, covering data mining, statistics, machine learning, spatial database technology, biology, marketing, and other fields, clustering Analysis has become an active research topic in the field of data mining. Common algorithms include K-means, K-center, CLARANS, birch, clique, and DBSCAN.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.