http://blog.csdn.net/pipisorry/article/details/51461997
representation of Bayesian network graph model
To understand the role of the graph for describing the probability distribution, first consider an arbitrary joint distribution P (A, B, c) on three variables A, B, C. Note that at this stage we do not need to make any more assumptions about these variables, such as whether they are discrete or continuous. In fact, a powerful aspect of the graph model is that a specific graph can describe a large class of probability distributions. By using the product rule of probability (1.11), we can write the joint probability distribution in the following form.
P (A, b, c) = P (C | A, B) p (A, B) (8.1)
Using the product rule again, the second item on the right side of this processing equation (8.1), we have
P (A, b, c) = P (C | A, B) p (b | a) P (a) (8.2)
Note that this decomposition method is valid for any choice of joint probability distributions.
A simple diagram model representation
Now, we use a simple graph model to represent the right side of the equation (8.2), as described below. First, we introduce a node for each random variable A, B, C, and then correlate the corresponding conditional probabilities on the right side of the formula (8.2) for each node. Then, for each conditional probability distribution, we add a link (arrow) to the graph, and the starting point of the link is the node corresponding to the random variable in the condition of the conditional probability. Therefore, for Factor P (c | A, B), there is a link from Node A, b to node C, and for Factor P (a), there is no input link. The result is the figure in Figure 8.1. If there is a link from Node A to Node B, then we say that Node A is the parent node of Node B, and Node B is a sub-node of Node A. Note that we do not formalize the variables corresponding to the regional nodes and nodes, but rather simply use the same notation to denote both. The interesting thing about the formula (8.2) is that it's left about three variables A, B, C is symmetric, and the right side is not. In fact, by the decomposition of the formula (8.2), we implicitly chose a specific order (i.e. A, B, c). If you choose a different order, we get a different decomposition method, so we get a different graph representation.
Graph model representation of K- variable joint probability distribution
We extend the example shown in Figure 8.1 to the joint probability distributions p (x 1, ..., x k) of K variables. By the product rule of repeated use probability, the joint probability distribution can be written as the product of the conditional probability, and each item corresponds to a variable in the form of
P (x 1, ..., x k) = P (x k | x 1, ..., x k?1) ... p (x 2 | x 1) p (x 1) (8.3)
corresponding to a given k, we can represent it as a forward graph with K nodes, each node corresponds to a conditional probability distribution to the right of the formula (8.3), and the input link for each node includes all links that begin with a node numbered below the current node number. We say that this graph is fully connected (fully connected), because there is a link between each pair of nodes.
Missing Connection graph representation
So far, the object we operate on is a completely general joint probability distribution, which can be applied to arbitrary selection of probability distributions by means of decomposition and corresponding full connection graph representation. As we will see, the interesting information that really conveys the nature of the probability distribution represented by the graph is the missing link in the graph (absence). Consider the diagram in Figure 8.2. This is not a fully connected diagram because there is no link from X 1 to x 2 or from X 3 to x 7.
Now, we will write out the corresponding joint probability expression according to this picture. The joint probability expression consists of a series of conditional probabilities, each of which corresponds to a node in the graph. Each such conditional probability distribution is only conditional on the parent node of the corresponding node in the graph. For example, x 5 is conditional on X 1 and X 3. As a result, the joint probability distribution of 7 variables is
P (x 1) p (x 2) p (x 3) p (x 4 | x 1, x 2, x 3) p (x 5 | x 1, x 3) p (x 6 | x 4) p (x 7 | x 4, x 5) (8.4)
The general relationship between the probability distributions of a forward graph and a variable
We now describe the general relationship between the given graph and the corresponding probability distributions on the variables. The joint probability distribution defined on all nodes of the graph is represented by the product of the conditional probability distribution on each node, and the condition of each conditional probability distribution is the variable corresponding to the parent node of the node in the graph. Therefore, for a graph with K nodes, the joint probability is
Where PA K represents a collection of parent nodes of x k, x = {x 1, ..., x k}. This key equation represents the decomposition (factorization) attribute of the joint probability distribution of the graph model. Although the situation we have previously considered is that each node corresponds to a variable, we can easily generalize to each node of RID the set of variables associated with a variable, or a variable of the associated vector value. It is easy to prove that if each conditional probability distribution on the right side of the formula (8.5) is normalized, the representation method is always normalized.
Phi Blog
Example: Bayesian polynomial regression1.2.6 Curve Fitting problem
In the curve fitting problem, we know that the training data x and T, and a new test point X, our goal is to predict the value of T. So we want to estimate the predicted distributions P (T | x, X, T). Here we are going to assume that the parameter α and β are fixed, knowing beforehand (in the following chapters we will discuss how this parameter is inferred from the data using the Bayesian method). To put it simply, the Bayesian method is the addition and the rule of probability and the product rule from beginning to finish. So the predictive probabilities can be written in the following form
P (T | x, X, T) =∫p (T | x, W) p (w | x, t) d W
The representation of the probability distribution of random variable joint
As an example of the probability distributions described in the graph of graphs, we consider the Bayesian polynomial fitting model introduced in the 1.2.6 section. The random variable in this model is the polynomial coefficient vector w and the observed data T = (t 1, ..., t N) T. Additionally, this model contains the input data x = (x 1, ..., x N) T, the noise variance σ2, and the hyper-parameter α that represents the precision of the Gaussian prior distribution of W. All of these are parameters of the model rather than random variables.
At this stage we focus only on random variables, and we see that the joint probability distribution equals the product of the prior probability distribution P (W) and n conditional probability distributions p (t n | w) (n = 1, ..., n), i.e.
(lz:8.6 and P (w | x, T) are proportional)
The graph model represents the joint probability distribution as shown in 8.3.
The plate (plate) representation of the complex model
As we begin to deal with more complex models, we will see that, as shown in Figure 8.3, we explicitly write T 1, ..., and T N nodes are very inconvenient. Thus, we introduce a graph structure that allows multiple nodes to be represented more concisely. In this graph structure, we draw a single representation of the node T N and then circle it with a box called a Plate (plate), labeled N, which indicates that there are n points of the same type . To re-represent figure 8.3 in this way, we get the diagram shown in Figure 8.4.
Explicitly representing parameters and random variables
We sometimes find it helpful to explicitly write the parameters and random variables of the model. At this point, the formula (8.6) becomes
Correspondingly, we can explicitly write X and α in the graph representation. To do this, we will follow the following convention: A random variable is represented by a hollow circle, and a deterministic parameter is represented by a small solid circle . If we let figure 8.4 contain deterministic parameters, we get figure 8.5.
Observing variables and potential variables
When we apply a graph model to the problem of machine learning or pattern recognition, we usually set some random variable to a specific value, such as setting the variable {T n} according to the training set in the polynomial curve fitting. In the graph model, we represent this observation variable by adding a shadow to the corresponding node (observed variables). So, in the diagram shown in Figure 8.5, if {T n} is an observation variable, then it becomes Figure 8.6. Note that W is not an observation variable, so w is an example of a potential variable (latent variable). A potential variable is also known as an implied variable (hidden variable).
The posterior probability of the coefficient w
The value of {T n} is observed, and if necessary, we can calculate the posterior probability of the coefficient w. At this stage, we notice that this is a direct application of the Bayes theorem.
Probability distribution of T with observed data as condition
Among them, we omit the deterministic parameters once again, making the notation concise.
In general, we are not interested in parameters such as W, because our ultimate goal is to predict the input variables. Given an input value of x, we want to find the probability distribution of the corresponding T, which is conditional on the observed data. Figure 8.7, which describes the problem, is shown in diagram model. with deterministic parameters (x), the joint distribution of all random variables (t, W) of this model is
Then, according to the addition and rule of probability, the prediction distribution of T can be obtained by the model parameter W integral.
Where we implicitly set the random variable in t to the specific value observed in the data set.
Phi Blog
built-in models
Ancestor sampling (ancestral sampling)
In many cases, we want to extract samples from a given probability distribution. Here is a brief introduction of ancestral sampling (ancestral sampling), which is particularly relevant to the graph model. Consider a joint probability distribution of K variables p (x 1, ..., x K), which is decomposed according to the formula (8.5), corresponding to a direction-free graph. We assume that the variables have been sorted so that there is no link from a node to a lower ordinal node. In other words, the ordinal of each node is greater than its parent node. Our goal is to sample X 1 from such a joint probability distribution, ..., x K
To accomplish this, we first select the least-numbered node, which is sampled by the probability distribution P (x 1) and recorded as X 1. Then we calculate each node sequentially, so that for node n, we sample the conditional probability p (x n | pa N), where the parent node's variables are set to their sampled values. Note that at each stage, the variables of these parent nodes are always available because they correspond to nodes that have been sampled for smaller ordinal numbers. The sampling method according to the specific probability distribution will be discussed in detail in the 11th chapter. Once we have finished sampling the last variable x K, we have achieved the goal of sampling according to the joint probability distribution. In order to sample from the edge probability distribution corresponding to a subset of variables, we simply take the sampled values of the required nodes, ignoring the sampled values of the remaining nodes. For example, in order to sample from probability distributions p (x 2, x 4), we simply sampled the joint probability distribution and then reserved x 2, x 4, discarding the remaining value {x j? =2,4}.
Practical application of Probabilistic model
For practical applications of probabilistic models, it is common to say that a large number of variables correspond to the terminal nodes of the graph (representing observations), and fewer variables correspond to potential variables. The main function of the latent variable is to make the complex distribution on the observation variable can be expressed as a model constructed by simple conditional distribution (usually exponential family distribution).
We can represent such a model as the process of observing data generation. For example, consider a pattern-aware task where each observation value corresponds to an image (consisting of a vector of pixel grayscale values). In this case, the potential variable can be considered as the position or direction of the object. Given a particular observation image, our goal is to find the posterior probability distribution on the object, where we have integral to all possible positions and orientations. We can use the graph model of Figure 8.8 to represent this problem.
The image (a vectorof pixel intensities) has a probability distribution thatis dependent on the identity of the object as Well ason its position and orientation.
Generative models (generative model)
The graph model describes a causal relationship (causal) process for generating observational data (Pearl, 1988). Thus, this model is often referred to as a generative model (generative models). Instead, the polynomial regression model described in Figure 8.5 is not a generative model, because there is no probability distribution associated with the input variable x, so data points cannot be artificially generated from this model. By introducing the appropriate prior probability distribution P (x), we can turn the model into a generative model at the cost of increasing the complexity of the model.
However, the implied variables in the probabilistic model do not have to have explicit physical meanings. It can be introduced just to create a more complex joint probability distribution from simpler components. In either case, the ancestor sampling method applied to the generative model simulates the creation of the observational data, so that the "fantasy" data can be produced, and its probability distribution (if the model perfectly represents reality) is the same as the probability distribution of the observed data. In practical application, it is helpful to understand the probability distribution form that the model represents by generating artificial observation data from a generative model.
Phi Blog
Discrete Variables
{How to build a federated probability distribution on a set of discrete variables}
The exponential family probability distribution treats many famous probability distributions as special cases of exponential family distribution. Although the exponential family distribution is relatively simple, they form the basic components for constructing more complex probability distributions. The frame of the graph model is useful for expressing the connection between these basic elements.
If we choose to conjugate the relationship of each parent node-sub-node pair in the graph, there are some particularly good properties for this model, and we will give a few examples later. It is worth noting that both parent and child nodes correspond to discrete variables, and that they all correspond to Gaussian variables, because in both cases, the relationship can be generalized in a hierarchical manner to construct arbitrarily complex, non-circular graphs. We first examine the case of discrete variables.
The expression of probability distribution of one-variable discrete variables
For a unary discrete variable x with K possible states (using the "1-of-k" expression), the probability p (x | μ) for
(LZ: Polynomial distribution)
And by the parameter μ= (μ1, ..., μk) T control. Because of the existence of the restriction kμk = 1, in order to define the probability distribution, we only need to specify K? A value of 1 μk.
The expression of probability distribution of multivariate discrete variables
Now suppose we have two discrete variables x 1 and x 2, each with a K-state, and we model the joint probability distributions of them. We take the probability that x 1k = 1 and x 2l = 1 are observed simultaneously as parameter μkl, where x 1k represents the K component of X 1, and the meaning of x 2l is similar. The joint probability distribution can be written
Since the parameter μkl satisfies the constraint, the distribution is k^ 2? 1 parameter control. It is easy to see that for any one joint probability distribution of M variables, the number of parameters that need to be determined is K ^m 1, so the exponential increase with the number of variable M .
Reduce the number of independent parameters in the model
1 Reducing Diagram Links
Using the product rule of probability, we can decompose the joint probability distribution P (x 1, x 2) to P (x 2 | x 1) p (x 1), which corresponds to a graph with two nodes, the link from node x 1 to Node x 2, 8.9 (a). The Edge probability distribution p (x 1) is the same as before, by K? 1 parameter control. Similarly, conditional probability distributions p (x 2 | x 1) need to specify K 1 parameters to determine the k possible values of x 1 . So, as before, in a federated probability distribution, the total number of parameters that need to be specified is (K 1) + K (k? 1) = k ^ 2? 1.
Now assume that the variables x 1 and x 2 are independent, corresponding to the graph model shown in Figure 8.9 (b). Thus, each variable is described by a separate polynomial probability distribution, and the total number of parameters is 2 (K? 1). For the probability distributions on M independent discrete variables, where each variable has K possible states, the total number of parameters is M (K 1), so the number of variables increases linearly. From the perspective of graphs, we reduce the number of parameters by removing links between nodes, at the cost of limiting the probability distribution of the categories.
More generally, if we have M discrete variables x 1, ..., x M, then we can use a graph to model a joint probability distribution, with each variable corresponding to a node. The conditional probability distribution of each node is given by a set of non-negative parameters, and the normalization restriction condition needs to be satisfied. If the graph is fully connected, then we have a completely general probability distribution, and this distribution has K M? 1 parameters. If there is no link in the graph, then the joint probability distribution can be decomposed into the product of the edge probability distribution, the total number of parameters is M (K? 1). The graph of connectivity between the two allows the model to deal with a more general probability distribution than the full decomposition probability distribution, while the number of parameters is less than that of the general joint probability distribution. As a description, consider the node chain shown in Figure 8.10. the Edge probability distribution p (x 1) requires K 1 parameters, and for the M-1 conditional probability distributions p (x I | x i?1) (Where i = 2, ..., M) requires K (K-1) Parameters . Thus, the total number of parameters is K? 1 + (M-1) k (K-1), which is the two-time function of k, and increases linearly with the length of the chain (rather than exponential growth).
2 parameter sharing
Another way to reduce the number of independent parameters in the model is parameter sharing (sharing), also known as parameter banding (tying). For example, in the example of the node chain given in Figure 8.10, we can make all the conditional probability distributions P (xi|xi?1) (where i=2,..., M) by the same parameter set K (k?1). In addition to the k?1 parameters that control x1, the total number of parameters required to define a federated probability distribution is k2?1.
Bayesian model: Dirichlet priori for the introduction of parameters
By introducing the Dirichlet priori of the parameters, we can transform the graph model on the discrete variable into the Bayesian model. From the point of view of the graph, each node requires an additional parent node to represent the parameters corresponding to each discrete node. This scenario is illustrated in Figure 8.11.
If we take the parameters of the control conditional probability distribution p (x I | x i?1) (Where i = 2, ..., M) for parameter sharing, then the corresponding model 8.12 is shown.
3 using parameterized models for conditional probability distributions
Another way to control the exponential increase in the number of parameters of a discrete variable model is to use a parameterized model for conditional probability distributions rather than a complete table with conditional probability values. To illustrate this idea, consider the diagram shown in Figure 8.13, where all the nodes represent binary variables. Each parent node variable x i is controlled by a single parameter μi, which represents the probability p (x i = 1), so that for m parent nodes, the total number of parameters is M. However, the conditional probability distribution p (x 1, ..., x M) requires 2 m parameters, each representing a probability p (y = 1) Under the possible configuration of the 2 m parent node variable. Therefore, in general, the number of parameters that determine the probability distribution of the condition increases with the M index. The logistic sigmoid function is applied to the linear combination of the parent node variable, and we can get a more concise conditional probability distribution in the form of
where σ (a) = (1+exp (? a)) ^?1 is a logisticsigmoid function, x= (x0,x1,..., XM) T is a (m+1) dimension vector that represents the M state of the parent node plus an additional variable x0 whose value is fixed to 1. w= (w0,w1,..., WM) T is a vector of m+1 parameters. This is a more restrictive form of conditional probability distribution than the general case, but the number of parameters increases linearly with M. In this case, it is similar to the constrained form of a covariance matrix that chooses a multivariate Gaussian distribution (such as a diagonal matrix).
Phi Blog
Linear Gaussian model
{How the multivariate Gaussian distribution is represented as a direction-free graph of a linear Gaussian model on a component variable}
The paper shows how the multivariate Gaussian distribution is represented as a direction-free graph of the linear Gaussian model on the component variables. This allows us to apply interesting structures to the probability distribution, where the two opposite extremes are general Gaussian distributions and diagonal covariance Gaussian distributions. Several widely used methods are examples of linear Gaussian models, such as probabilistic principal component analysis, factor analysis, and linear dynamic systems (Roweis and ghahramani,1999).
Joint probability distributions of multiple Gaussian distributions p (x)
Consider any of the arbitrary, forward-free graphs on the D variables, where the node i is the unary continuous random variable x i that obeys the Gaussian distribution. The mean value of this distribution is the linear combination of the state of the parent node I of the node I, i.e.
where W ij and b I are parameters that control the mean, V I is the variance of the conditional probability distribution of x i. Thus, the logarithm of the joint probability distribution is the logarithm of the product of these conditional distributions on all nodes in the graph, so the form is
where x = (x 1,...,xd)t,"constant" represents an item unrelated to X. We see that this is the two-time function of the element of X, so the joint probability distribution P (x) is a multivariate Gaussian distribution.
Recursive calculation of mean and variance of joint probability distributions
We can determine the mean and variance of the joint probability distribution recursively, by the following method. The probability distribution of each variable x i is a Gaussian distribution (conditional on the parent node state), as shown in the formula (8.11). So
Where Εi is a 0 mean unit variance of the Gaussian random variable, satisfies e[εi] = 0 and E[εiεj] = i ij, wherein I ij is the unit matrix of the I, J elements. To the equation (8.14) Take the expectation that we have
In this way, starting with the lowest ordinal node, we can find the elements of e[x] = (e[x 1], ..., e[x D]) T by recursively calculating along the graph. Here, once again, we assume that the ordinal of all nodes is greater than the period of its parent node. Similarly, we can use the formula (8.14) and (8.15) to get the covariance matrix of P (x) in a recursive way, I, j elements, i.e.
The covariance can therefore be calculated recursively from the node with the lowest ordinal number.
Two extreme situations
Let's consider two extreme situations.
First, assume that there is no link in the diagram, so the diagram consists of an isolated D node. In this case, the parameter W IJ does not exist, so only the D parameter b i and d parameter v i. According to the recursive relationship (8.15) and (8.16), we see that the mean value of P (x) is the (b1,...,bd)t, covariance matrix, which is a diagonal matrix in the form of Diag (v1,...,vd). The combined probability distribution has a total of 2D parameters representing a set of D independent unary Gaussian distributions.
Now consider a fully connected diagram where the ordinal of each node is lower than the ordinal of its parent node. So the matrix W ij line I has I? 1 items, so the matrix is a lower triangular matrix (there are no elements on the main diagonal). The number of parameters W IJ can thus be obtained by taking the number of DXD elements d^2, minus D, indicating that there are no elements on the main diagonal, divided by 2, because the matrix has only elements below the diagonal, so that the total number of parameters is D (d?1)/2. The Independent parameter {W IJ} is prefixed with {V i} in the covariance matrix, so the total number of independent parameters is D (d+1)/2, which corresponds to a general symmetric covariance matrix.
The complexity is in between two extreme situations
The graph between the two extremes of complexity corresponds to the covariance matrix taking a particular form of the joint Gaussian distribution. Consider the diagram in Figure 8.14, where there is no link between variables x 1 and x 3. Using recursive relationships (8.15) and (8.16), we see that the mean and covariance of the combined Gaussian distribution are
We can already extend the linear articulator model to the case where the node represents a multivariate Gaussian variable. In this case, we can write the conditional probability distribution of node I in the following form
Now W ij is a matrix. If the dimensions of x I and X J are different, then W ij is not a square. As before, it is easy to prove that the joint probability distributions on all variables are Gaussian distributions.
Note that we have seen that the conjugate transcendental of the mean value μ of the Gaussian variable x is a Gaussian distribution on the μ itself. At this point we have encountered a specific example of a linear Gaussian relationship. So the joint distribution of X and μ is Gaussian distribution. This corresponds to a simple graph with two nodes, where μ and node are the parent nodes that represent the nodes of X. The mean value of the probability distribution on μ is the parameter that controls the prior distribution, so it can be considered as a super parameter. Since the value of the hyper-parameter itself is unknown, we can again use the Bayesian viewpoint to introduce a priori probability distribution on a super-parameter. This priori probability distribution is sometimes referred to as transcendental (Hyperprior), which is also a Gaussian distribution. This construction process can, in principle, extend to any level. This model is an example of hierarchical Bayesian models (hierarchical Bayesian model).
from:http://blog.csdn.net/pipisorry/article/details/51461997
REF:PRML Chapter 8:graphical MODELS
PGM: Bayesian Network