PRML 4:generative Models

Source: Internet
Author: User

From the perspective of probability theory, a classification problem is usually divided into two stages: (I) inference stage refers to the establishment of a certain type of probabilistic model, the posterior probability distribution is obtained by a certain way $p (c_k\text{|} \VEC{X}) $; (II) The decision stage is based on a posteriori probability distribution, which makes predictions on the feature vectors that are unknown to the tag. In the concrete realization process can be divided into two schools: (I) The statistical school often uses the MLE or MAP and other means to estimate the model parameters, in the inference stage to obtain a definite probability distribution, and then found in the decision stage so that expected loss up To the smallest predicted value, (II) the Bayesian school learns and approximates the distribution of the parameters, then marginalization all possible parameters to get the predicted values, such as

$p (C_k|\vec{x}_{n+1},x,\vec{t}) =\int P (c_k|\vec{x}_{n+1},\vec{w}) \cdot P (\vec{w}| X,\vec{t}) \cdot d\vec{w}$, which asks for $p (\vec{w}| X,\vec{t}) $ take a conjugate priori there are often online algorithms.

Regardless of the school, inference stage can be divided into two modeling methods: (I) generative model is to establish class-conditional distribution $p (\vec{x}\text{|} C_k) $ of the parameter model, first get the joint probability distribution $p (c_k,\vec{x}) $ re-calculation of the posterior probability distribution $p (c_k\text{|} \VEC{X}) $; (II) The discriminative model directly models and learns the posterior probability distributions, the most classic example being logistic regression.

The previously mentioned discriminant function is a kind of opportunistic judgment method, not based on the probability theory, but directly to the classification decision modeling, and then construct the appropriate objective function to optimize. Compared with this method, the advantage of the method based on probability theory is: (I) can modify expected loss at any time, (II) can solve the problem of the imbalance of positive and negative samples by constructing a training set with appropriate prior distribution ; III) can be fused with independent, labeled models, such as $p (c_k\text{|} \vec{x},\vec{y}) =\frac{p (c_k\text{|} \VEC{X}) p (c_k\text{|} \vec{y})}{p (c_k)}$.

A typical generative model is Naive Bayes Classifier with Laplace smoothing: Given the class label, we A Ssume the feature Components is conditionally independent distributed, i.e. $p (\vec{x}^{(i)}=a_{is},\vec{x}^{(j)}=a_{ jr}\text{|} C_k) =p (\vec{x}^{(i)}=a_{is}\text{|} C_k) \cdot P (\vec{x}^{(j)}=a_{jr}\text{|} C_k) $.

(1) Prior: $p (c_k) =\frac{\sum_{n=1}^n I (y_n=c_k) +1}{n+k}$ for $0\leq k<k$;

(2) Likelihood: $p (\vec{x}^{(j)}=a_{jl}\text{|} C_k) =\frac{\sum_{n=1}^n I (\vec{x}_n^{(j)}=a_{jl},c_k) +1}{\sum_{n=1}^n I (Y_n=c_k) +s_j}$;

(3) Prediction: $y =\mathop{argmax}_{c_k}p (c_k) \cdot \prod_{j=0}^{k-1}p (\vec{x}^{(j)}=\vec{x}_{n+1}^{(j)}\text{|} C_k) $.

  Gaussian dicriminant Analysis (GDA) was another example that makes an MAP estimate to do a prediction:given the C Lass label, we assume the feature vector is Gaussian distributed. Here we take $K =2$ for example.

(1) Prior: $p (c_k) =\frac{1}{n}\cdot\sum_{n=1}^n I (y_n=c_k) $ for $k =0,1$;

(2) Likelihood: $p (\vec{x}\text{|} C_k) =gauss (\vec{x}\text{|} \VEC{\MU}_K,\SIGMA) $, where by MLE we have

$\vec\mu_k=\frac{\sum_{n=1}^n I (y_n=c_k) \cdot\vec x_n}{\sum_{n=1}^n i (y_n=c_k)}$, and $\sigma=\frac{1}{n}\sum_{n=1} ^n (\vec X_n-\vec\mu_{y_n}) \cdot (\vec x_n-\vec\mu_{y_n}) ^t$;

(3) Prediction: $y =\mathop{argmax}_{c_k} p (c_k) \cdot p (\vec{x}_{n+1}\text{|} C_k) $.

The difference between GDA and logistic regression is that, compared with GDA, the model hypothesis of logistic regression is weaker, it does not require the data to obey the same variance normal distribution and the scope of application is wider; but the advantage of GDA is that when the data obeys its assumptions, it is more than Logistic regression more accurate, and as long as fewer samples can achieve the same convergence effect. NB and Softmax regression relationship with the similar: Softmax regression does not require each characteristic component obey independent conditional distribution, it can actually replace all satisfy $p (\vec{ x}\text{|} C_k) \propto The generation model of the e^{\vec{w}_k^t\vec{x}}$ distribution hypothesis. Generally, the general model tends to have high bias and low variance (easy to cause underfitting), which is suitable for small training sets; Discriminative model tends to have an bias, high Varian CE (easy to cause overfitting), suitable for large training sets.

References:

1. Bishop, Christopher M. Pattern recognition and machine learning [m]. Singapore:springer, 2006

PRML 4:generative Models

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.