Beauty of mathematics-ordinary and magical Bayesian method (3)

Source: Internet
Author: User

3.2 model comparasion and Bayesian Occam's razor)
In fact, model comparison is to compare which model (guess) is more likely to be hidden behind the observed data. The basic idea is explained in the example of spelling correction. We guess the words that users actually want to input are models, and the words that users enter incorrectly are the observation data. We use:
P (H | D) ∝ P (h) * P (d | H)

Which model is the most reliable.

As mentioned above, relying on P (d | H) (that is, "likelihood") is not enough. Sometimes we need to introduce the prior probability of P (h.The Occam Razor means that the model with a large P (h) has a great advantage, while the maximum likelihood is that it is the most consistent with the observed data (that is, the maximum value of P (d | H) most advantageous. The whole model is compared with the pull saw of the two forces. Let's take another simple example to illustrate this spirit: you can find a coin, throw it, and observe the result. Well, the result you observe is either "positive" or "inverse". Suppose you observe "positive ". Now, based on the observed data, you can infer how likely the coin is to roll "positive. Based on the maximum likelihood estimation, we should guess that the probability of the coin throwing "positive" is
1, because this is the guess that can maximize P (d | H. However, everyone will shake their heads. Obviously, the probability that you randomly find a coin that has no Opposite sides is "nonexistent ", we have a prior understanding of whether a random coin is a biased coin or not, that is, the vast majority of coins are basically fair, more coins are less common (a beta distribution can be used to express this probability ). Combine this prior normal distribution P (θ) (where θ represents the ratio of the coin to the front, and P in lower case represents the probability density function) into our problem, we do not want to maximize P (d | H), but to maximize P (d
| θ) * P (θ), obviously θ = 1 is not good, because P (θ = 1) is 0, resulting in the entire product is 0. In fact, you only need to find a derivative of this formula to obtain the most value point.

The above is that when we know the prior probability P (H), the maximum likelihood is unreliable, because the maximum likelihood of speculation may be very small. However, in some cases, we do not know anything about the prior probability. We can only assume that the prior probability of each prediction is equal. At this time, we only need to use the maximum likelihood. In fact, there is an interesting debate between statisticians and Bayesian scientists,Statistician said: we let the data speak on our own. The implication is to discard the anterior probability. Bayes' supporter said: data may have various deviations, and a reliable anterior probability can be robust to these random noises.. Facts have proved that the Bayesian family has won. The key to victory is that the so-called anterior probability is actually the result of empirical statistics. For example, why do we think that the vast majority of coins are basically fair? Why do we think most people are moderately obese? Why do we think that skin color is racial, and that weight is irrelevant to race? The "anterior" in the anterior probability does not refer to the observation data given prior to all experience, but the observation data given prior to our "current, in the coin example, a priori refers to the experience that we know the result of throwing, rather than being "innate ".
However, in other words, sometimes we have to admit that, even based on past experience, the "anterior" probability at hand is still evenly distributed, so we must rely on the maximum likelihood, we use the natural language ambiguity question that we left behind to illustrate this point:
The girl saw the boy with a telephony.
Is it the girl saw-with-a-telephony the boy syntax or the girl saw the-boy-with-a-telephony? The two syntax structures are similar in common (You may think that the latter syntax structure is less common, which is an afterthought bias, you just need to think about the girl saw the boy with a book. Of course, in fact, from the large-scale corpus statistical results, the last syntax structure is a little less common, but it is definitely not enough to explain our strong preference for the first structure ). So why?

Let's take a look at a beautiful example in the book:


How many boxes are in the figure? Specifically, what is a box behind the book? Or two boxes? Or three boxes? Or... you may think there must be a box behind the tree, but why not? For example:


Very easy, you will say: if there are really two boxes, that's strange, how come these two boxes are just the same color, the height is the same?
In the probability theory language, your words are translated as: I guess H is not true, because P (d | H) is too small (too coincidental. Our intuition is that a coincidence (low probability) event will not happen. So when a guess (hypothesis) turns our observation results into small probability events, we say, "It's strange. How can we be so clever ?!"
Now we can go back to the natural language ambiguity example and give a perfect explanation: If the syntax structure is the girl saw the-boy-with-a-telecope, why is the boy holding a telescope in his hand-A stuff that can be used for saw-? This is also a small probability. Why won't he take a book? Everything is good. How come we get the telescope? So the only explanation is that there must be its inevitability behind this "coincidence". This inevitability is, if we interpret the syntax structure as the girl saw-with-a-telease the boy, it would be perfectly consistent with the data-since the girl used something to look at the boy, this is a telescope that can be fully explained (no longer a small probability event ).

Natural language ambiguity is very common, for example, the above sentence:

See "decision making and judgment" and "rationality for mortals" Chapter 12th: children can also solve Bayesian problems.
It has two meanings: Are you referring to chapter 12th of these two books, or are you referring to chapter 12th of the second book? If it is the 12th chapter of the two books, it is a strange thing. How exactly is there 12th chapter in both books, both of which are about the same question? What's even more strange is that the title is still the same?
Note that the above is the likelihood estimation (that is, only the size of P (d | h), without the prior probability. Through these two examples, especially the box behind the tree, we can see that the likelihood estimation also contains the Occam Razor: the more boxes behind the tree, the more complicated the model is. The model of a single box is the simplest. Likelihood Estimation selects a simpler model.
This is the so-called Bayesian Occam's Razor because the razor works on the likelihood (P (d | H) of the Bayesian formula, instead of the prior probability of the model itself (P (h), the latter is the traditional Occam Razor. Let's take a look at the Bayesian Occam Razor. We can see an example of Curve Fitting mentioned above: if there are N points on the plane, it will constitute a straight line, but it will never be precise in a straight line. In this case, we can use a straight line to fit (Model 1), a second-order polynomial (Model 2), or a third-order polynomial (Model 3 ),.., in particular, the use of N-1 order polynomials can ensure that the perfect pass
N data points. Which of these possible models is the most reliable? As mentioned above, the basis for measuring is Occam Razor: the more complex and uncommon the higher the order of polynomials. However, we don't actually need to rely on this prior Occam Razor, because some people may argue: How can you say that the higher the polynomial, the less common? I think that all polynomial orders are possible. Well, in this case, let's discard P (H) and see what P (d | h) can tell us. We noticed that the higher the order of polynomials, the larger the trajectory bending degree. When we reached the eighth or ninth order, we simply went straight up and down. So we should not only ask: for example, the probability that a bunch of N points randomly generated by the eighth-order Polynomials on the plane exactly approximate a straight line (that is
How big is P (d | H? It's too small. Otherwise, if the model is a straight line, the probability of generating a bunch of points close to a straight line is much greater. This is the Bayes Occam Razor.
Here is just a popular science about Bayesian Occam Razor, which emphasizes an intuitive explanation. For more theoretical formulas, see Chapter 28th of Information Theory: inference and learning algorithms.

3.3 Minimum length of Description Principle

Bayesian model comparison theory has an interesting association with information theory:
P (H | D) ∝ P (h) * P (d | H)
Calculate the number of pairs on both sides and convert the product of the right formula to the sum:
Ln P (H | D) ln P (h) + ln P (d | H)
Obviously, maximizing P (H | D) means maximizing ln P (H | D ). Ln P (h) + ln P (d | h) can be interpreted as a model (or "hypothesis" or "speculation ") the encoding length of H plus the encoding length of data D in this model. Make this and the smallest model the best model.
How to define the encoding length of a model and the encoding length of data under the model is a problem. For more information, see section 6.6 of Mitchell's machine learning, or section 28.3 of macay)
3.4 optimal Bayesian inference
The so-called reasoning involves two processes. The first step is to establish a model for the observed data. The second step is to use this model to predict the probability of an unknown phenomenon. We have discussed the most reliable model for the observed data. However, many times, although a model is the most reliable in all models, other models do not have any chance. For example, the first model has a probability of 0.5 under the observed data. The second model is 0.4, and the third is 0.1. If we only want to know which model of the observed data is most likely, we only need to take the first one. The story ends here. However, we often create models to predict the probability of an unknown event. At this time, the three models have their own predictions on the probability of an unknown event, just because a model has a higher probability, it is too undemocratic to listen to him alone. The so-called optimal Bayesian reasoning means weighted average of the prediction conclusions of the three models on unknown data (the weight is the probability of the model ). Obviously, this reasoning is the theoretical master and cannot be optimized, because it has taken all possibilities into consideration.
In fact, we basically don't use this framework, because the computing model may be time-consuming and the model space may be continuous, there are multiple infinite models (the probability distribution of the model needs to be calculated at this time ). The result is still very time-consuming. Therefore, this is regarded as a theoretical benchmark.
From: http://mindhacks.cn/2008/09/21/the-magical-bayesian-method/

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.