Classification Algorithm overview and Comparison

Source: Internet
Author: User
Tags svm rbf kernel

Classification is an important research area in data mining, machine learning, and pattern recognition. By analyzing and comparing representative excellent classification algorithms in current data mining, the characteristics of various algorithms are summarized, which provides a basis for users to select algorithms or researchers to improve algorithms.

I. Classification Algorithm Overview

There are many ways to solve classification problems. A single classification method mainly includes decision tree, Bayesian, artificial neural network, K-nearest neighbor, support vector machine, and classification based on association rules; there are also integrated learning algorithms used to combine a single classification method, such as Bagging and boosting.

(1) Decision Tree
Decision tree is one of the main technologies used for classification and prediction. Decision Tree Learning is an example-based inductive learning algorithm, it aims to extract classification rules represented by decision trees from a group of unordered instances. The purpose of constructing a decision tree is to identify the relationship between attributes and categories and use it to predict the categories of records of unknown categories in the future. It uses the top-down Recursion Method to compare attributes on the internal nodes of the decision tree, and judges the branches down from the node based on different attribute values, and draws a conclusion on the leaf nodes of the decision tree.
The main decision tree algorithms include ID3, C4.5 (c5.0), cart, public, sliq, and sprint algorithms. They have their own differences in choosing the technology used for testing attributes, the structure of the generated decision tree, the method for pruning, and the time when they can process large datasets.

(2) Artificial Neural Networks
Artificial Neural Networks (ANN) is a mathematical model used to process information in a structure similar to the neural network of the brain. In this model, a large number of nodes (or "neurons" or "units") are connected to each other to form a network, that is, a "Neural Network", to process information. Neural Networks usually need to be trained, and the training process is the network learning process. Training changes the connection weight of a network node so that it can be classified. The trained network can be used for object recognition.
At present, there are hundreds of different neural networks, including BP networks, radial basis function compute (RBF) networks, local neural networks, random neural networks (Boltzmann machines), and Competitive Neural Networks (Hamming networks, self-Organizing ing Network. However, neural networks still have disadvantages such as slow convergence, large computing capacity, long training time, And uninterpretability.

(3) Support Vector Machine
Support Vector Machine (SVM) is a new learning method proposed by Vapnik based on statistical learning theory. Its biggest feature is to minimize structural risks, the optimal hyperplane of classification is constructed at the maximum interval to improve the generalization ability of the learning machine. The problem of non-linearity, high dimension, and local minimization is well solved. For classification problems, the SVM algorithm calculates the decision surface of the region based on the samples in the region, and then determines the category of unknown samples in the region.

(4) VSM Method

The VSM (vector space model) method was proposed by Salton and others at the end of 1960s. This is the earliest and most famous mathematical model for information retrieval. ItsThe basic idea is to represent a document as a weighted feature vector: D = D (T1, W1; T2, W2 ;...; TN, wn), and then determine the type of the sample to be classified by calculating the text similarity. When the text is represented as a spatial vector model, the text similarity can be expressed by the inner product between feature vectors.

In practical application, the VSM method is generally used to establish a class vector space based on the training samples and classification system in the corpus. To classify a sample to be classified, you only need to calculate the similarity between the sample to be split and each category vector, that is, the inner product, then, select the category with the highest similarity as the category corresponding to the sample to be split.

In the VSM method, the class Space Vector needs to be calculated in advance, and the establishment of the Space Vector is largely dependent on the feature items contained in the class vector. According to the study, the more non-zero feature items contained in a category, the weaker the ability of each feature item to express a category. Therefore, the VSM method is more suitable for the classification of Professional documents than other classification methods.

 

(5) Bayes
Bayesian classification algorithms are a type of algorithms that use probability statistics knowledge for classification, such as naive Bayes. These algorithms use Bayes Theorem to predict the likelihood of an unknown sample of each category. The class with the highest possibility is selected as the final category of the sample. The establishment of Bayesian Theorem requires a strong assumption of conditional independence, which is often not true in actual situations. Therefore, the accuracy of classification will decrease. Therefore, many Bayesian classification algorithms have emerged to reduce the independence hypothesis, such as tan (tree augmented na? The ve Bayes algorithm is implemented by adding association between attribute pairs based on the Bayesian network structure.

(6) k-nn
The K-Nearest Neighbor (KNN, K-Nearest Neighbors) algorithm is an instance-based classification method. This method is used to find the K training samples closest to the unknown sample X. To see which type of training samples most of these samples belong to, X is classified as that type. K-Nearest Neighbor (k-nn) is a lazy learning method that stores samples and performs classification only when classification is required. If the sample set is complex, it may cause a high computing overhead, therefore, it cannot be applied to scenarios with strong real-time performance.

(7) classification based on Association Rules
Association rule mining is an important research area in data mining. In recent years, scholars have extensively studied how to apply association rule mining to classification. Association classification method mining is like a condset → C rule. condset is a set of items (or attribute-value pairs), and C is a class label, such rules are called class association rules (CARs ). The association classification method is generally composed of two steps: the first step is to use the association rule mining algorithm to extract all the class association rules that meet the specified support and confidence level from the training data; step 2 use a heuristic method to select a group of high-quality rules from the extracted association rules for classification. The associated classification algorithms mainly include CBA [44].
, Adt, cmar, etc.

(8) ensemble learning)
The complexity of actual applications and the diversity of data often make a single classification method ineffective. Therefore, scholars have extensively studied the integration of multiple classification methods, that is, integrated learning. Integrated learning has become a hot topic in the International Machine Learning field and is called one of the four main research directions of machine learning.
Integrated Learning is a machine learning paradigm. It tries to call a single learning algorithm consecutively to obtain different basic learning devices and then combine these learning devices according to the rules to solve the same problem, it can significantly improve the generalization ability of the learning system. The combination of multiple basic learning tools mainly uses the weighted voting method. common algorithms include bagging and boosting.

Ii. classification algorithm comparison

When no more background information is provided, if the prediction accuracy is pursued, SVM is generally used. If the model is required to be interpreted, a decision tree is generally used. Select Gaussian Kernel (RBF kernel) when using SVM, and select appropriate model parameters using cross validation.

The following table compares common classification algorithms.
Kotsiantis, S. B.
Supervised Machine Learning: A review of classification techniques
Informatica, 2007, 31, 249-268

 

The conclusions in this table are similar. They come from classic classics.
Hastie, T.; tibshirani, R. & Friedman, J.
The elements of statistical learning, Second Edition
Springer, 2009

Note (reference article ):

1. http://blog.csdn.net/c6h5no2/article/details/3961143

2. http://www.docin.com/p-152657092.html

3. http://blog.csdn.net/chl033/article/details/5204220

4. http://www.chinabi.net/blog/user1/105/archives/2005/332.html
 
 

 


Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.