Abstract
bayesian Networks is a powerful probabilistic representation, and their use for classification have received considerable attention . however, they tend to perform poorly when learned on the Standard. This was attributable to a mismatch between the objective function used (likelihood or a function Thereof) and the goal of Classification (maximizing accuracy or conditional likelihood). unfortunately, the computational cost of optimizing structure and parameters for conditional likelihood are Prohibitive. In this paper we show, a simple approximation–choosing structures by maximizing conditional likelihood while setting Parameters by maximum Likelihood–yields good results. On a large suite of benchmark datasets, This approach produces better class probability estimates than naïve Bayes, TAN, a nd generatively-trained Bayesian networks.
1. Introduction
The simplicity and surprisingly high accuracy of the naïve Bayes classifier has LEDs to their wide use, and to many attempts To extend it. In particular, naïve Bayes are a special case of a Bayesian network, and learning the structure and parameters of an unrest ricted Bayesian Network would appear to be a logical means of improvement. however, Friedman et al found, naive Bayes easily outperforms such unrestricted Bayesian network classifiers on a Lar GE Sample of Benchmark Datasets. This explanation is, the scoring functions used in standard Bayesian network learning attempt to optimize the Likelih Ood of the entire data, rather than just the conditional likelihood of the class Given. Such scoring results in suboptimal choices during the search process whenever the both functions favor differing changes to The Network. The natural solution would then is to use conditional likelihood as the objective Function. unfortunately, Friedman et al observed, while MaximuM likelihood parameters can be efficiently computed in closed form, This is not true of conditional likelihood. The latter must is optimized using numerical methods, and doing so at each search step would being prohibitively Expensive. Friedman et al thus abandoned this avenue, leaving the investigation of possible heuristic alternatives to it as an impor Tant direction for the Future. In this paper, we show then the simple heuristic of setting the parameters by maximum likelihood while choosing the struct Ure by conditional likelihood is accurate and efficient.
Friedman et al chose instead to extend naive Bayes by allowing a slightly less restricted structure (one parent per Varia BLE in addition to the Class) while still optimizing likelihood. They showed that TAN, the resulting algorithm, is indeed more accurate than naive Bayes on benchmark Datasets. We Compare our algorithm to TAN and naive Bayes on the same datasets, and show that it outperforms both in the accuracy of Class probability estimates, while outperforming naive Bayes and tying TAN in classification Error.
2. Bayesian Networks
a Bayesian network encodes the joint probability distribution of a set of variables, as a directed acyclic graph and a set of conditional probability tables (cpts). Each node corresponds to a variable, and the CPT associated with it contains the probability of each of the variable Given every possible combination of States of its parents. The set of parents of, denoted, is the set of nodes with a arc to in the Graph. The structure of the network encodes the assertion, each node is conditionally independent of its non-descendants give n its parents. Thus the probability of an arbitrary event can is computed AS. In general, encoding the joint distribution of a set of discrete variables requires space exponential in; Bayesian networks reduce this to space exponential in.
2.1 Learning Bayesian Networks
Given an I.I.D training set, where, The goal of learning are to find the Bayesian network, the best represents the joint Distribution. One approach is to find the network, maximizes the likelihood of the data of (more conveniently) its logarithm:
(1)
when The structure of the network is known, this reduces to estimating, and the probability that variable are in state given that Its parents is in the state and for All. When there is no examples with missing values in the training set and we assume parameter independence, the maximum Likel Ihood estimates is simply the observed frequency estimates, where was the number of occurrences in the training set of th e th state of a with the th state of its parents, and is the sum of.
Since on average adding a arc never decreases likelihood on the training data, using the log likelihood as the scoring Fu Nction can leads to severe overfitting. This problem can is overcome in a number of ways. The simplest one, which is often surprisingly effective, are to limit the number of parents a variable can Has. Another alternative is to add a complexity penalty to the LOG-LIKELIHOOD. For example, the MDL method minimizes, where is the number of the parameters in the Network. In both these approaches, the parameters of each candidate network is set by maximum likelihood, as in the Known-structur E Case. Finally, the full Bayesian approach maximizes the Bayesian Dirichlet (BD) score:
(2)
where Is the structure of network, are the Gamma function, is the number of States of the Cartesian product of ' s parents, and Is the number of States Of. is the prior probability of the structure, which Heckerman et al. set to an exponentially decreasing function of the Numbe R of differing arcs between and the initial (prior) Network. Each multinomial distribution for given a state of its parents have an associated Dirichlet prior distribution with Paramet ers, With. These parameters can be thought of as equivalent to seeing occurrences of the corresponding states in advance of the train ing Examples. In this approach, the network parameters is not set to specific values; rather, their entire posterior distribution is implicitly maintained and Used. The BD score is the result of integrating over this distribution.
2.2 Bayesian Network classifiers
The goal of classification is to correctly predict the value of a designated discreteclassVariable given a vector ofPredictorsOrattributes. If The performance measure is accuracy (i.e., the fraction of the correct predictions made on a test sample), the optimal pred Iction for was the class that Maximizes. If we have a Bayesian network for, these probabilities can be computed by inference over it. In particular, theNaïve Bayes classifieris a Bayesian network where the class have no parents and each attribute have the class as its sole parent. Friedman et al. ' s TAN algorithm uses a variant of the Chow and Liu method to produce a network where each variable have one Other parent in addition to the class. More generally, a Bayesian network learned using any of the methods described above can be used as a classifier. All of these isgenerativeModels in the sense that they is learned by maximizing the log likelihood of the entire data being generated by the model ,, or a related Function. however, for classification purposes only theConditional Log LikelihoodOf the class given the attribute is relevant, where
(3)
Notice. Maximizing can leads to underperforming classifiers, particularly since in practice the contribution of are likely to be SWA Mped by the generally much larger (in absolute Value) term. A better approach would presumably is to use by itself as the objective Function. This would is a form ofdiscriminativelearning, because it would focus on correctly discriminating between classes. The problem with this approach are, unlike (equation 1), does not decompose to a separate term for each variable, an D As a result there is no known closed form for the optimal parameter Estimates. When the structure are known, locally optimal estimates can be found by a numeric method such as conjugate gradient with Li Ne search, and this was what Greiner and Zhou ' s ELR algorithm Does. When the structure is unknown, a new gradient descent are required for each candidate network at each search Step. The computational cost of the presumably Prohibitive.
3. The BNC algorithm
We now introduce BNC, an algorithm for learning the structure of a Bayesian network classifier by maximizing conditional l Ikelihood. BNC is similar to the hill climbing algorithm of Heckerman et al. except that it uses the conditional log likelihood of th E class as the primary objective Function. BNC starts from a empty network, and at each step considers adding each possible new arc (i.e., all those that does not cre Ate cycles) and deleting or reversing each current arc. BNC pre-discretizes continuous values and ignores missing values in the same to that TAN Does.
We consider the versions of BNC. The first, avoids overfitting by limiting it networks to a maximum of parents per Variable. Parameters in each network is set to their maximum likelihood values. The network is then scored using the conditional log likelihood (equation 3). The rationale for this approach are that computing maximum likelihood parameter estimates are extremely fast, and, for an op Timal structure, They is asymptotically equivalent to the maximum conditional likelihood Ones.
The second version, is the same as, except that instead of limiting the number of parents, the scoring function is used , where is the number of parameters in the network and is the training set Size.
The goal of are to provide accurate class probability estimates. If Correct class predictions is required, a Bayesian network classifier could in principle be learned simply By using the Training-set accuracy as the objective function (together with some overfitting avoidance scheme). While trying a interesting item for a future work, we note that even in this case the conditional likelihood could be preferable, because it is a more informative and more smoothly varying measure, potentially leading to an easier Optimiza tion Problem.
Learning Bayesian Network classifiers by maximizing Conditional likelihood