Weka Practice Mastering the use of open source data mining tools

Source: Internet
Author: User
Keywords Data mining open source Weka

In order to meet this demand, data mining technology has made great progress, and classification in data mining is a very important task, currently in the commercial application of the most. This paper mainly focuses on the comparison of the effect of classification algorithm in data mining, through simple experiment (using open source Data Mining tool-weka) to verify the effect of different classification algorithms, to help the novice to understand the characteristics of different classification algorithms, and to master the use of open source data mining tools.

Classification algorithm is a method to solve classification problem, and it is an important research field in data mining, machine learning and http://www.aliyun.com/zixun/aggregation/12097.html "> Pattern recognition." Classification algorithms are used to predict the categories of new data by analyzing the training sets of known classes and discovering the classification rules. Classification algorithms are widely used in bank risk assessment, customer classification, text retrieval and search engine classification, security domain intrusion detection and software project application and so on.

Introduction to classification algorithms

The typical classification algorithms are described below.

Bayes

The classification principle of Bayesian classifier is to use Bayesian formula to calculate the posterior probability, that is, the probability of the object belonging to a certain class, and to select the classes with the maximum posterior probability as the class of the object. At present, there are mainly four kinds of Bayesian classifier, namely: Naive Bayes, TAN, BAN and GBN.

Bayesian Network (bayesnet)

Bayesian network is a direction-free graph with probability annotation, each node in the graph represents a random variable, and if there is an arc between the two nodes, the random variable corresponding to the two nodes is probabilistic, whereas the two random variables are conditional independent. Any node x in the network has a corresponding conditional probability table Conditional probability table,cpt), which is used to indicate the conditional probability of node x taking each possible value in its parent node. If the node x has no parent node, the CPT of X is its prior probability distribution. The structure of Bayesian networks and the CPT of each node define the probability distributions of each variable in the network. Using Bayesian network classifier to classify is mainly divided into two stages. The first stage is the study of Bayesian network classifier, which constructs classifiers from sample data, including structure learning and CPT learning; the second stage is Bayesian network classifier's inference, that is, calculating the conditional probability of the class node, classifying the classified data. The time complexity of these two stages depends on the degree of dependence between the eigenvalues, even the NP complete problem, so it is necessary to simplify the Bayesian network classifier in practical applications. According to the hypothesis of different correlation degree between eigenvalues, various Bayesian classifiers can be obtained.

Naive Bayesian (Naivebayes)

The naive Bayesian model (NBC) originated in classical mathematical theory, has a solid mathematical foundation and stable classification efficiency. At the same time, the parameters of the NBC model are very few, which are not sensitive to the missing data, and the algorithm is simple. The NBC model assumes that attributes are independent of each other, and this assumption is often not tenable in practical applications, which has a certain impact on the correct classification of the NBC model. The classification efficiency of the NBC model is inferior to the decision tree model when the number of attributes is more or the correlation between attributes is large. The performance of the NBC model is the best when the attribute correlation is small.

Lazy Learning

Compared with other inductive Learning algorithms, the Lazy Learning method is only to save the information of the sample set until the test sample arrives. This means that the decision model is generated after the test sample arrives. Compared with other classification algorithms, this kind of classification algorithm can learn the model according to the sample information of each test sample, so the learning model can better fit the local sample characteristics. KNN algorithm is very simple and intuitive: if a sample in the feature space of the most similar (that is, the closest in the feature space of the most adjacent) of the majority of the sample belongs to a certain category, then the sample belongs to this category. The rationale is to find a sample of K near the test sample when the test sample arrives, and then select one of the most concentrated categories of these neighbor samples as a test sample. There are two algorithms for KNN in Weka, respectively, IB1,IBK.

IB1 1 Neighbors

IB1 is a neighbor to judge the category of test samples.

IBk K-Nearest neighbor

IBk is the category of test samples judged by the K neighbors around it.

There are a lot of noise points in the sample (noisy points), the effect of passing a neighbor is obviously worse, because there will be more error. In this case, IBk becomes an excellent option. This time there is a problem, K this value how to determine, generally this k is through experience to judge.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.