Beauty of mathematics-ordinary and magical Bayesian method (4)

Source: Internet
Author: User

4. ubiquitous Bayes
Here we will give some practical examples to illustrate the universality of Bayesian method, which is mainly focused on machine learning because I am not an economic student, otherwise, you can find a bunch of examples of economics.
4.1 Chinese Word Segmentation
Bayesian is one of the core methods of machine learning. For example, Bayesian is used in the Chinese word segmentation field. Google researcher Wu Jun introduced Chinese word segmentation in the "Beauty of mathematics" series. Here we will only introduce the core ideas and will not repeat them.
The word segmentation problem is described as follows: a sentence (string) is given, for example:Yangtze River Bridge, Nanjing,How to perform word segmentation (word string) is the most reliable. For example:
1. Nanjing/Yangtze River Bridge, 2. Nanjing/mayor/Yangtze River Bridge?Which of the two word segmentation functions is more reliable? We use Bayesian formulas to formally describe this problem, so that X is a string (sentence), and Y is a word string (a specific word segmentation hypothesis ). We just need to find the y that makes P (Y | x) the largest, and use a Bayesian to get it:
P (Y | x) ∝ P (y) * p (x | Y)
In natural language, this is the possibility of word segmentation (word string) multiplied by the possibility that the word string will generate our sentence. We can further easily see that p (x | Y) can be considered as a constant equal to 1, because any hypothetical word splitting method generates sentences accurately (you only need to discard the boundary between Word Segmentation ). So we turn to maximizing P (Y), that is, finding a word segmentation to maximize the probability of this word string (sentence. How to calculate a word string:

W1, W2, W3, W4... is it possible? We know that according to the formula of the joint probability, we can expand:

P (W1, W2, W3, W4 ..) = P (W1) * P (W2 | W1) * P (W3 | W2, W1) * P (W4 | W1, W2, W3 )*..

Therefore, we can use the product of a series of conditional probabilities (the right formula) to calculate the total joint probability. Unfortunately, however, as the number of conditions increases (P (WN | Wn-1, Wn-2 ,.., w1) condition has N-1), data sparse problem will become more and more serious, even if the corpus again big also cannot count a reliable P (WN | Wn-1, Wn-2 ,.., w1. To alleviate this problem, computer scientists continue to use the "Naive" assumption: let us assume that the probability of a word in a sentence is only dependent on the limited K words in front of it (K generally does not exceed 3. If it only depends on the previous word, it is a 2-gram language model.
3-gram, 4-gram, etc.), this is the so-called "finite horizon" hypothesis. Although this assumption is silly and naive, it indicates that its results are often very good and powerful. The assumptions used by the naive Bayes method mentioned later are completely consistent with those used in the spirit, we will explain why such a naive assumption can produce powerful results. At present, we only need to know that with this assumption, the product just now can be rewritten:

P (W1) * P (W2 | W1) * P (W3 | W2) * P (W4 | W3 )..

(Assume that each word only depends on one word before it ). Statistics on P (W2 | W1) are no longer plagued by data sparse issues. For the example "Nanjing Yangtze River Bridge" we mentioned above, if you follow the greedy method from left to right, the result will be "Nanjing lOng/jiangda bridge ". However, if the word is segmented by Bayes (assuming 3-gram is used), because the frequency of "Nanjing mayor" and "jiangda bridge" in the corpus is 0, the probability of the entire sentence is determined as 0. In this way, the word Splitting Method "Nanjing/Yangtze River Bridge" won.

Note:Some may wonder if we humans are reasoning based on these naive assumptions? No. In fact, statistics on machine learning methods are often at the shallow level. At this level, machine learning can only see some superficial phenomena, people with one idea of Scientific Research know that the more the world goes to the surface, the more complicated and changing the world. From the perspective of machine learning, there are more features, and hundreds of dimensions are possible. If there are many features, the high-dimensional curse will be generated, and the data will be sparse and useless. Our human observation level is obviously more advanced than the observation level of machine learning. To avoid data sparsity, we constantly invent various devices (typically a microscope ), to help us directly go deep into the deeper layer of things to observe more essential relationships, rather than statistical induction on superficial phenomena. To give a simple example, through the statistics of large-scale corpus, machine learning may find that all "him" will not wear
All "she" is worn by bra. However, as a man, there is no need to do any statistical learning at all, because the deep law determines that we will not wear bra at all. Whether machine learning can complete the reasoning of the latter (like humans) is a classic issue in the AI field. At least before that, the saying that statistical learning methods can end scientific research (the original article) is purely What outsiders say.
4.2 machine translation statistics
Statistical Machine Translation is a fact standard for machine translation because it is simple and automatic (no manual rules are required. The Bayesian method is also used to calculate the Core Algorithm of machine translation.
What is the problem? The problem of machine translation statistics can be described as follows: given a sentence E, which of its possible foreign language translation F is the most reliable. That is, we need to calculate P (f | E ). Conditional Probability Bayes always comes forward:
P (f | E) ∝ P (f) * P (E | f)

The right side of this formula is easy to explain: F, a foreign sentence that has a higher anterior probability and is more likely to generate sentence E, will win. We only need to make a simple statistics (combined with the n-gram language model mentioned above) to calculate the probability of F in any foreign sentence. However, P (E | f) is not so easy to find. What is the probability of generating (or corresponding) Sentence e given a candidate foreign speaker F? We need to define what is "". Here we need to use a parallel corpus with Word Segmentation alignment. For more information, see foundations of statistical natural.
In Chapter 13th of language processing, here is an example: Assume e is John loves Mary. The first f we need to examine is Jean aime Marie (French ). We need to determine the size of P (E | f). Therefore, we need to consider the number of alignment possibilities of E and F, for example:John (Jean) loves (aime) Marie (Mary )....Is one of the (most reliable) alignment. Why is alignment necessary? It is because after alignment, it is easy to calculate the alignment under this alignment.
P (E | f) is big, just calculate:

P (John | Jean) * P (loves | aime) * P (Marie | Mary)
You can. Then we traverse all alignment modes and sum the translation probabilities of each alignment mode by Σ. You can obtain the total P (E | f) size.
Note:Or that question: do we humans translate in this way? Highly unlikely. We are not stupid enough to use this complicated computing thing even if the three-digit multiplication method is impossible. According to the cognition of cognitive neuroscience, it is very likely that we first extract the folding process from sentence to semantics (a bottom-up process ), then, the semantics is expanded into the specific unfolding process of another language (one layer by layer (top-down) according to the syntax of another language ). It is still a challenge to implement this process in a computation manner. (We can see that there are many
Bottom-up/top-down is a symmetric process. In fact, some people speculate that this is the operating method of the biological neural network in principle. The research on the visual neural system proves this in particular, hawkins proposes an htm (hierarchical temporal memory) model in on intelligence .)
4.3 Bayesian image recognition, analysis by synthesis

Bayesian method is a very general reasoning framework. Its core concept can be described as: Analysis by synthesis (analysis by synthesis ). In the latest advances in cognitive science in, paper explained the use of Bayesian reasoning for visual recognition:


First, the visual system extracts the corner features of the image, and then uses these features to activate high-level abstract concepts (such as E, F, and equal signs) from the bottom up ), then we use a top-down verification to compare which concept best explains the observed image.

4.4 EM algorithm and model-based clustering
Clustering is a kind of unsupervised machine learning problem. Problem description: it gives you a bunch of data points that let you divide them into a bunch of most reliable ones. There are many clustering algorithms. Different algorithms adapt to different problems. Here we only introduce a model-based clustering. The assumption of this clustering algorithm's log point is, these data points are randomly generated around K core K normal distribution sources, using the diagram in Han Jiawei's data Ming: concepts and techniques:


The figure shows two normal distribution cores, which generate roughly two heap points. Our clustering algorithm is based on the given points to determine the core of the two normal distributions and the distribution parameters. This is obviously another Bayesian question, but this time the difference is that the answer is continuous and there are infinite possibilities, and worse, only when we know which points belong to the same normal distribution circle can we make a reliable prediction of this distribution parameter, now we don't know which points belong to the first normal distribution and which belong to the second when the two heap points are mixed together. In turn, only when we make a reliable prediction of the distribution parameters can we know which points belong to the first distribution and those points belong to the second distribution. This becomes a problem of having a chicken or an egg. To solve this circular dependency, one party must first break the deadlock and say, no matter what it is, I will make a whole value first to see how you change it, then I adjusted my changes based on your changes, and then iterated and deduced each other, and finally converged to a solution.This is
EM Algorithm
.
Em means "expectation-maximazation". In this clustering problem, we first guess the parameters of the two normal distributions, such as where the core is and what is the variance. Then, it is calculated that each data point is more likely to belong to the first or second Normal Distribution circle. This is an expectation step. With the ownership of each data point, we can re-evaluate the parameter of the first distribution based on the data point that belongs to the first distribution (from the egg to the chicken). This is maximazation. Until the parameters do not change. The Bayesian method in this iterative convergence process is located in step 2, where the distribution parameters are obtained based on data points.

4.5 maximum likelihood and least squares


Those who have learned linear algebra probably know the classic Least Square Method for linear regression. Problem description: Given N points on the plane (here we may want to use a straight line to fit these points-regression can be seen as a special case of fitting, that is, allow fitting of errors ), find a line that best describes these points.
One of the following questions is how do we define the best? We set the coordinates of each vertex to (xi, Yi ). If the straight line is y = f (x ). Then (XI, Yi) and a straight line on the point of the "prediction" :( Xi, F (XI) is a different △yi = | Yi-F (xi) |. The least square means finding a straight line minimizes (△y1) ^ 2 + (△y2) ^ 2 +... (that is, the sum of squares of errors,There is no good statistical explanation for the sum of squares of errors rather than the absolute values of errors.. However, the Bayesian method can provide a perfect explanation for this.
Let us assume that the prediction F (xi) given by a straight line for coordinate Xi is the most reliable prediction, and all data points whose ordinate deviation from coordinate F (xi) contain noise, it is noise that makes them deviate from a perfect straight line. A reasonable assumption is that the probability of moving farther from the route is smaller. A normal distribution curve can be used to simulate the deviation, the distribution curve is centered on the Prediction F (xi) given by a straight line to XI, and the actual ordinate is Yi (XI, Yi) the probability of occurrence is proportional to exp [-(△yi) ^ 2]. (Exp (...) represents the power of the base of the constant E ).
Now we are back to the Bayesian aspect of the problem. We want to maximize the Posterior Probability:
P (H | D) ∝ P (h) * P (d | H)
See Bayes again! Here, H refers to a specific line, and D refers to the n data points. We need to find a straight line h to maximize P (h) * P (d | H. Obviously, the prior probability of P (H) is even, because the straight line is no better than the other one. Therefore, we only need to look at P (d | H), which refers to the probability that this line will generate these data points. As we have just said, generate data points (XI, Yi) the probability is exp [-(△yi) ^ 2] multiplied by a constant. P (d | H) = P (d1 | h) * P (D2 | h )*.. that is, if each data point is generated independently, each probability can be multiplied. Generate
The probability of n data points is:

Exp [-(△y1) ^ 2] * exp [-(△y2) ^ 2] * exp [-(△y3) ^ 2] * .. = exp {-[(delta Y1) ^ 2 + (delta Y2) ^ 2 + (delta Y3) ^ 2 + ..]}

To maximize the probability, we need to minimize (△y1) ^ 2 + (△y2) ^ 2 + (△y3) ^ 2 +... Are you familiar with this formula?

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.