From the Bayesian method, we talk about Bayesian Networks.

Source: Internet
Author: User

From the Bayesian method, we talk about Bayesian Networks.

Talking about Bayesian networks from Bayesian methods


0 Introduction

In fact, there are a lot of books about Bayesian theorem, Bayesian method, Bayesian inference, such as a brief history of mathematical statistics, and Statistical Decision Theory and Bayesian analysis James O. berger, and so on. However, there are very few Chinese documents about Bayesian Networks, and there are only a few Chinese books, most of which are English documents, however, when a beginner first came up, he threw him a bunch of English papers. It was a pity that he could not continue reading because he had difficulty reading the paper due to lack of basic skills and language barriers (of course, with a certain foundation, you can read more English documents ).

On the morning of October November 9, Zou Bo spoke about Bayesian Networks in the 9th class in the Machine Learning class, which helped you extract several key points of Bayesian Networks: bayesian network definition, chain network, tree network, factor chart, non-tree network into tree network ideas, and Summary-Product algorithm, etc, what is the goal? I believe that reading English papers is also better understood.

Therefore, this article is based on the PPT of Bayesian Network in Zou Bo's 9th course and relevant references. Starting from the Bayesian method, it focuses on Bayesian Network and can still be defined as a reading or learning note, if you have any questions, please do not hesitate to point out thanks.



1 Bayesian Method

For a long time, people only have fixed 0 and 1 for the probability of one thing happening or not, that is, either happening or not, we will never consider the probability of a thing happening, but the probability of not happening. The probability is unknown, but at least it is a definite value. For example, if someone asked a question: "There is a bag containing several white balls and black balls, what is the probability of getting white balls from the bag ?" They don't even need to think about it. They will immediately tell you that the probability of retrieving a white ball is 1/2. Either the white ball is obtained or the white ball is not obtained. That is, θ can only have one value, not 1/2, it is 0, and no matter how many times you take it, the probability θ of the white ball is always 1/2, that is, it does not change with the observed result X.

This frequency school's point of view has long ruled people's ideas until a character named Thomas Bayes came into being.

1.1 proposal of Bayesian Method

Thomas Bayes (1702-1763) was not well-known at the time and rarely published papers or books. He had little communication with people in the academic community at that time, in today's words, Bayes is a living folk academic "diaosi ", but this "diaosi" finally published An article titled "An essay towards solving a problem in the doctrine of chances", which translates into a solution to a problem in the opportunity theory. You may think that the publication of this paper creates a sensation at random, thus laying the foundation for Bayesian's position in academic history.

As a matter of fact, since the publication of the previous paper, there was not much impact at that time. After the 20th century, this paper was gradually paid attention to by people. In this regard, like Van Gogh, the painting is worthless during his lifetime, and its value is even greater than that after his death.

Return to the example above: "There is a bag containing several white balls and black balls. What is the probability θ of getting white balls from the bag ?" Bayes believes that the probability of getting white balls is an uncertain value because it contains opportunity components. For example, if a friend starts a business, you know that there are two kinds of entrepreneurial results: either success or failure. But you still can't help but estimate the chances of success? If you know more about him, have a clear idea, have perseverance, and can unite people around him, you may estimate that his chances of success may be more than 80%. This is different from the initial "non-black, white, non-0, and 1" thinking method, which is a Bayesian thinking method.

Before continuing to explain the Bayesian method in depth, let's briefly summarize the different ways of thinking between the frequency School and the Bayesian school:

  • The frequency school regards the θ parameter to be inferred as a fixed unknown constant, that is, although the probability is unknown, it is at least a definite value. At the same time, the sample X is random, therefore, the frequency School focuses on the sample space, and most of the probability calculations are for the distribution of sample X;
  • The view of Bayesian is the opposite. They think that parameters are random variables, while sample X is fixed. Because samples are fixed, they focus on parameter distribution.

Comparatively speaking, the viewpoint of frequency school is easy to understand, so the viewpoint of Bayesian school is described below.

Bayesian is regarded as a random variable, so to calculate the distribution, we need to know the unconditional distribution in advance, that is, before a sample (or before observing X ), what is the distribution?

For example, if I threw a ball at the pool table, where would the ball fall? If the ball is thrown out without bias, the chance of dropping the ball to any position on the pool table is the same, that is, the probability that the ball falls to a certain position on the pool table is evenly distributed. Prior to the experiment, the distribution that is a basic prerequisite is called a prior distribution or an unconditional distribution.

So far, Bayes and Bayesian have proposed a fixed model for thinking:

  • Prior Distribution + posterior distribution of sample information

The above thinking model means that the newly observed sample information will correct people's previous cognition of things. In other words, before getting new sample information, people's cognition is a prior distribution. After getting new sample information, people's cognition is.

Prior information generally comes from experience and historical data. For example, when Lin Dan vs a contestant, the commentary will generally make a rough judgment on the victory or defeat of Lin Dan based on the results of previous competitions. For example, a factory needs to perform quality control on products every day, in order to evaluate the product's nonconforming rate θ, a large amount of historical data will be accumulated after a period of time. These historical data are a prior knowledge, with this prior knowledge, therefore, we have a basis for determining whether a product requires daily quality control. If historical data shows that the failure rate of a product is only 0.01%, it can be regarded as trustworthy or exempt from inspection, only one or two spot checks per month, saving a lot of manpower and material resources.

The posterior distribution is generally considered to be the conditional distribution under the given sample, and the maximum value is called the Maximum Posterior Estimation, similar to the maximum likelihood estimation in classical statistics.

In a word, it seems that at the beginning, humans only had a poor anterior knowledge of nature. But as we continue to observe and experiment, we get more samples and results, as a result, people are increasingly familiar with the laws of nature. Therefore, the Bayesian method is not only in line with the way people think about their daily lives, but also in line with the natural laws of people's understanding. After continuous development, it eventually occupies half of the field of statistics.

In addition to the above thinking model, Bayes also puts forward the world-famous Bayesian theorem.

1.2 Bayes Theorem

Before introducing Bayesian theorem, we should first learn several definitions:

  • The conditional probability is the probability of event A occurring under another event B. The conditional probability is expressed as P (A | B) and is read as "the probability of A under B ".
  • Joint probability indicates the probability that two events occur together. The Union probability between A and B is expressed as or.
  • Edge probability (also known as a prior probability) is the probability of an event. The edge probability is obtained as follows: in the joint probability, the events that are not needed in the final result are merged into the full probability of the event and disappear (sum the full probability for the discrete random variable, for continuous random variables, use integral to obtain the full probability ). This is called marginalization ). The edge probability of A is P (A), and that of B is P (B ).

Next, consider the possibility that P (A | B) occurs when B occurs.

Bayesian theorem is based on the following Bayesian formula:

The derivation of the above formula is actually very simple, that is, the introduction of conditional probability.

According to the definition of conditional probability, the probability of event A occurring under event B is

Similarly, the probability of Event B occurring when event A occurs

Sort and merge the above two equations to find them:

Then, the two sides of the above formula are divided by P (B). If P (B) is non-zero, we can get Bayes TheoremFormula expression:

This article will not go into details about Bayesian theorem. Next, we will focus on Bayesian Networks.



2 Bayes Network

According to wikipedia, the Bayesian network (Bayesian network), also known as the belief network (belief network) or the directed acyclic graph model (directed acyclic graphical model), is a probability graph model, by the nature of a group of random variables and their n sets of conditional probability distributions (or ccpis) in a directed acyclic graph (directed acyclic graphs, or DAGs.

In short, a Bayesian network is formed by drawing random variables involved in a research system into a directed graph independently based on conditions.

Generally, Bayesian networks represent random variables to nodes in a acyclic graph. They can be observed variables, hidden variables, and unknown parameters. Arrows connecting two nodes indicate that the two random variables have a causal relationship (or are not conditional independent ). If the two nodes are connected by a single arrow, one of the nodes is "parents" and the other is "children )", A conditional probability value is generated at two nodes.

2.1 Bayesian network definition

G = (I, E) indicates a directed acyclic graph (DAG), where I represents a set of all nodes in the graph, while E represents a set of directed connection segments, and X = (Xi) I ε I is the random variable represented by a node I in the directed acyclic graph. If the joint probability of node X can be expressed:

X indicates a Bayesian Network relative to a directed acyclic graph G. X indicates the "reason" of node I ". In addition, for any random variable, the joint probability can be obtained by multiplying the probability allocation of their respective local conditions:

As shown in, it is a simple Bayesian Network:

Because a leads to B, a, and B lead to c, so there are

2.2 Bayesian Network instances

As shown in, the meanings of each word and expression are as follows:

  • Smoking indicates smoking. The probability is expressed by P (S), lung Cancer indicates lung Cancer, and a person's probability of lung Cancer is expressed by P (C | S) when smoking, x-ray indicates that medical X-rays are needed. Lung cancer may lead to X-rays. smoking may also lead to X-rays (so smoking is also a cause of X-ray ), therefore, the probability of taking X-rays due to smoking and lung cancer is expressed by P (X | C, S.
  • Bronchitis indicates Bronchitis, and a person's probability of Bronchitis when smoking is P (B | S), dyspnoea indicates breathing difficulties, Bronchitis may cause breathing difficulties, lung Cancer may also cause breathing difficulties (So lung Cancer is also a cause of dyspnoea). The probability of breathing difficulties due to smoking and bronchitis is indicated by P (D | C, B.

Lung Cancer is simplified to C, Bronchitis is simplified to B, dyspnoea is simplified to D, and C = 0 indicates the probability that lung Cancer will not happen, and C = 1 indicates the probability that lung Cancer will happen, B is equal to 0 (B does not happen) or 1 (B occurs) is also similar to C, D = 0 represents the probability of D occurrence, D = 1 represents the probability of D does not happen, A probability table of dyspnoea is displayed, as shown in the bottom right corner.

2.3 D-Separation

Given a Bayesian Network as shown in,

We can intuitively see from the figure that:

  • 1. x1, x2 ,... The Joint Distribution of X 7 is

  • 2. x1 and x2 are independent (corresponding to head-tohead );
  • 3. x6 and x7 are independent under the conditions specified by x4 (corresponding to tail-totail ).

It may be easy to understand at, but what does the conditional independence described at and mean? To clarify this problem, we need to introduce the concept of D-Separation (D-Separation.

D-Separation is a graphical method used to determine whether the variables are conditional independent. In other words, for a DAG (directed acyclic graph) E, the D-Separation method can quickly determine whether the two nodes are conditional independent.

2.3.1 head-to-head

Given the Bayesian Network shown in:

Therefore, P (a, B, c) = P (a) * P (B) * P (c | a, B) is valid. After simplification, you can obtain:

That is, when c is unknown, a and B are blocked (blocked) and are independent. It is called the head-to-head condition, corresponds to "x1 and x2 independent" in the starting figure in this section ".
2.3.2 tail-to-tail

Given Bayesian Networks

P (a, B, c) = P (c) * P (a | c) * P (B | c), P (a, B | c) = P (a, B, c)/P (c), then convert P (a, B, c) = P (c) * P (a | c) * P (B | c) is introduced to the upper formula, and P (a, B | c) = P (a | c) * P (B | c) is obtained ).
That is, under the conditions specified by c, a and B are blocked (blocked) and are independent. It is called the tail-to-tail condition, corresponds to "x6 and x7 are independent under the conditions given by x4 in the starting figure in this section, corresponding to the tail-to-tail shown in Figure 3 ".

2.3.3 head-to-tail

Given the Bayesian Network shown in:

P (a, B, c) = P (a) * P (c | a) * P (B | c ).

After simplification, you can:

That is, under the condition specified by c, a and B are blocked (blocked) and are independent. It is called the head-to-tail condition.

Insert: this head-to-head is actually a chain network, as shown in:

The distribution of xi + 1 and x1, x2... XI-1 conditions are independent. That is, the distribution of xi + 1 is only related to xi and independent from other variable conditions. This random process model of sequential evolution is called Markov model. And there are:

This article does not cover this topic, but the Markov model will be elaborated in subsequent blog posts.

OK, to promote the above nodes to the node set, it is: For any node set A, B, C, to test all the paths from any node in A to any node in B, if conditions A and B are required to be independent, all paths must be blocked (blocked), that is, one of the following two prerequisites is met:

For example, the preceding three cases of D-Separation are shown in:

2.4 factor chart

Return to the instance in section 2.2, as shown in:

What is the probability of sampling when a person has difficulty breathing? That is:

Let's take a step-by-step calculation and deduction:

To better solve such problems, we need to introduce the concept of a factor chart.

According to wikipedia, a global function factor with multiple variables is decomposed to obtain the product of several local functions. A bidirectional Graph Based on this is called a factor graph.

For functions, the following formula is used:

The corresponding factor diagram includes the variable node, factor node, and edge. Edge is obtained through the following factorization results: the existence of an edge between a factor node and a variable node is a sufficient condition.

The Factor Graph is obtained from Bayesian networks. The probability is calculated based on the message transmission idea in the Factor Graph.

To be continued. Update again at half past eight on January 1, November 11 ....



3. references and recommendations

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.