Random sampling and random simulation: the implementation of Gibbs sampling Gibbs sampling

Last Update:2016-05-30 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

http://blog.csdn.net/pipisorry/article/details/51525308

Implementation of Gibbs sampling

This article explains how to classify documents by Gibbs sampling (clustering), and, of course, a more complex implementation to see how the Gibbs sampling is sampled for the LDA topic distribution [ topic model Topicmodel: The implied Dirichlet distribution LDA].

The introduction of the Gibbs sampling stops at a detailed description of the Gibbs sample, such as random sampling and random simulations: Gibbs sampling Gibbs sampling(why) but does not indicate how the Gibbs sample was implemented (how)?

That is, how do you sample the following formula?

How do I handle continuous parameter problems in the model?

How to generate the expected value of the formula that we are interested in, instead of just doing the T-random walk?

How do you describe the naïve Bayesian na?? ve bayes[ Probabilistic graph Model: Bayesian networks and naive Bayesian networks ] build a Gibbs sampler that contains two big questions: how to use a conjugate priori? How to do the actual probability sampling in the condition of equation 14?

Phi Blog

The Gibbs sampler of naive Bayesian model
Problem

Based on naive Bayesian framework, the document is classified (unsupervised and supervised) through Gibbs sampling. Assuming features is the word under the document, we want to predict the Doc-level document classification label (sentiment label) with a value of 0 or 1.

First, the naïve Bayesian sampling is performed on unsupervised data, and the simplification of the supervisory data is explained later.

Following Pedersen [T. Pedersen. Knowledge lean word sense disambiguation. In Aaai/iaai, page 814, 1997, T. Pedersen. Learning probabilistic Models of Word sense disambiguation. PhD Thesis, Southern Methodist University, 1998. http://arxiv.org/abs/0707.3972.], we ' re going to describe the Gibbs sampler in a completely unsupervised setting where no Labels at all is provided as training data.

Model representation

The plate-diagram of naive Bayesian model:

The meaning of the variable represents the following table:

Given the document, we want to select the document's label L to make the following probability larger:

Document generation Process
Label generation for document J

Word bag Generation description for document J

Transcendental Priors

Where does π come from?

Hyperparameters:parameters of a prior, which is itself used to pick parameters of the model.

Our generative stories are going to assume that before this whole process began, we also pickedπrandomly. Specifically we ' ll assume Thatπis sampled from a Beta distribution with parametersγπ1 andγπ0.

In Figure 4 we represent these-hyperparameters as a single two-dimensional vectorγπ=γπ1, γπ0. whenγπ1 =γπ0 = 1, Beta (γπ1, γπ0) is just a uniform distribution, which means this any value forπis equally likel Y. For the reason we call Beta (1, 1) an "uninformed prior".

Where do θ0 and θ1 come from?

Letγθbe a v-dimensional vector where the value of every dimension equals 1. Ifθ0 is sampled from Dirichlet (γθ), every probability distribution over words would be equally likely. Similarly, we ' ll assumeθ1 is sampled from Dirichlet (γθ).

Note:θ0 is the probability distribution of the document Morphemes for label 0, and θ1 is the probability distribution of morphemes for a document of label 1. Θ0andθ1are sampled separately. There ' s no assumption that they is related to the other at all.

2.3 State space and initialization of the model

State space

Variable definition of state space in naive Bayesian model

? One scalar-valued variableπ (probability of label 1 for document J)
? vector-valued variables,θ0 andθ1
? Binary label variables L, one for each of the N documents
We also has one vector variable W j for each of the N documents, but these is observed variables, i.e.their values are a Lready known (and which are why W jk are shaded in Figure 4).

Initialization

Pick a valueπby sampling from the Beta (γπ1, γπ0) distribution. The probability that the label of document J is 1 is also known as the Bernoulli probability distribution (π, 1-π) of the label of document J.

Then, for each J, flip a coin with success probabilityπ, and assign label L J (0)-that are, the label of document J at the 0th iteration–based on the outcome of the coin flip. The Bernoulli distribution from the previous step shows the label of the document.

Similarly,you also need to initializeθ0 andθ1 by sampling from Dirichlet (γθ).

Derived Union distributions

For each iteration t = 1 ... T of sampling, we update every variable defining the state space by sampling from its conditional distribution given the O ther variables, as described in equation (14).

Processing process:

? We'll define the joint distribution of all the variables and corresponding to the numerator in (14).
? We simplify our expression for the joint distribution.
? We Use our final expression of the joint distribution to define what to sample from the conditional distribution in (14).
? We give the final form of the sampler as pseudocode.

Representation and simplification of federated distributions

The union distribution of the model for the entire document set is

Note: The right side of the semicolon is a union-distributed parameter, which means that the variable on the left side of the semicolon is in the right-side hyper-parameter condition.

The Union distribution can be decomposed into (through the graph model):

Factor 1:

Factor 2:

Factor 3:

Probability of distribution of words

Factor 4:

P (c 0 |θ0, L) and P (c 1 |θ1, L): the probabilities of generating the contents of the bags of words in each of the TW o Document classes.

Letθ=θl N:

The frequency of wni:w n morphemes i.

Documents are independent of each other, and the probability of merging documents in the same class:

Note:ncx (i): The Count of word I in documents with class label X.

The representation of the joint distribution and the reason for the prior selection

Use 19 and 21:

Use 24 and 25:

If you use words from all documents (that is, use 24 and 27)

The post-test distributed 30 is a unnormalized Beta distribution with parameters C 1 +γπ1 and C 0 +γπ0, and the formula 32 is a unnormalized Dirichlet distr Ibution, with parameter vectors N C x (i) +γθi for 1≤i≤v.

That is, a priori and posterior distribution is a form, so that the beta distribution is a conjugate priori of the distribution of binomial (and Bernoulli), and the Dirichlet distribution is a conjugate priori of the polynomial multinomial distribution.
The super-parameter is a pseudo-count pseudocounts, as observed evidence.

Let, the federated distribution of the entire document set is expressed as:

Put Pi out

Why: We can reduce the number of valid parameters of the model by integrating the pi off. This have the effect of taking all possible values Ofπinto account in our sampler, without representing it as a variable Explicitly and having a to-sample it at every iteration. Intuitively, "Integrating out" a variable was an application of precisely the same principle as computing the marginal prob Ability for a discrete distribution. As a result, C is "there" conceptually, in terms of our understanding of the model, but we don ' t need to deal with Manipul Ating it explicitly as a parameter.

Note: The point-out means that

The edge distribution of the joint distribution is therefore:

Consider only the points:

The integral term after 38 is a beta distribution with a parameter of C 1 +γπ1 and C 0 +γπ0, and the points for beta (c 1 +γπ1, C 0 +γπ0) are

Let n = c 0 + C 1

The 38-type representation is:

The federated distribution representation of the entire document Set (Tri-factor) is:

where N = c 0 + C 1

Construction of Gibbs Sampler Gibbs Sampler

Gibbs sampling is a new value given to Zi through conditional probabilities.

To calculate, you need to calculate the conditional distribution

Note:there ' s no superscript on the bags of words C because they ' re fully observed and don ' t change from iteration to ITER ation.

To calculate θ0, you need to calculate the conditional distribution

Intuitively, before each iteration t begins, we have the following current information:

The word count for each document, the document count labeled 0, the document count labeled 1, the current label,θ0 and θ1 for each document, and so on.

Sampling criteria

Sample Label:when We want to sample the new label for document J, we temporarily remove all information (i.e. word counts and L Abel information) about the this document from that collection of information. Then we look at the conditional probability this L j = 0 Given all the remaining information, and the conditional Probabil ity that L j = 1 Given the same information, and we are sample the new label L J (t+1) by choosing randomly according to the R Elative weight of those, conditional probabilities.

Sample θ:sampling to get the new values operates according to the same principal.

2.5.1 Sampling of document labels

Defining conditional probabilities

L (? j) is all the document labels except L J, and C (? j) was the set of all documents except W J.

The numerator is the total joint probability distribution, and the denominator is the same expression that removes the WJ information, so we need to consider just the 3 factors of the formula 40.

In fact, all we have to do is consider what is changed after WJ is removed.

Factor 1

Since factor 1 relies only on hyper-parameters, like numerator denominator, it is not considered, so only factor 2 and factor 3 in Formula 40 are considered.

Factor 2

The calculation of the denominator of the 42-factor 2 is related to how much the previous iteration of LJ is.

However, the corpus size is always changed from N to N-1, and the count of one of the document categories is reduced by 1. such as Lj=0, then, CX has only one change, so

Let X is the class for which C X (? j) = C x? 1, the formula 42 of the factor 2 is rewritten as:

Gamma (A + 1) = Aγ (a) for all a

The factor 2 for this formula 42 is simplified to:

Factor 3

In the same factor 2, there is always a class corresponding to the item has not changed, that is, the formula 42 of the factor 3 θ0 orθ1 has one in the numerator and the denominator is the same.

Merge

For x∈{0, 1}, the conditional distribution of the final merge to get the sampled document label is

From the formula 49 see how the label of the document is selected:

Type 49 Factor 1:l J = x considering only the distribution of the other labels

Type 49 factor 2:is like a word distribution "fitting-a-box", a indication of how well the words in W J "fit" with each of the Distributions.

The sampling process for the conditional distribution of type 42 is as follows

Note: Step 3 is a normalization of the probability distribution of the two labels.

For oversight data

Using labeled documents just don ' t sample L J for those documents! Always keep L J equal to the observed label.

The documents would effectively serve as "ground truth" evidence for the distributions that created them. Since we never sample for their labels, they would always contribute to the counts in (() and would never be SUBTR Acted out.

Sampling of 2.5.2θ

Since the distribution of the θ0 andθ1 is estimated to be independent, here we first eliminate the θ subscript.

Obviously

Since we used conjugate priors, this posterior, like the prior, works out to be a Dirichlet distribution. We actually derived the full expression, but we don ' t need the full expression here. All the We need to does to sample a new distribution are to make another draw from a Dirichlet distribution, and this time with P Arameters N C x (i) +γθi for each i in v.

Define the V dimensional vector t such that each:

Sampling formula for newθ

The implementation of sampling from Dirichlet distribution

Sample a random vector a = <a 1, ..., a v> from the v-dimensional Dirichlet distribution with parameters <α 1, ...,αv>

The fastest implementation is the draw V independent samples y 1, ..., y V from gamma distributions, each with density

Then (that is, the sample of the regularization gamma distribution)

[Http://en.wikipedia.org/wiki/Dirichlet Distribution]

The Gibbs sampling framework model of the entire naive Bayesian model has a priori definition:

=<1, 1> uninformed prior:uniform distribution

Letγθbe a v-dimensional vector where the value of every dimension equals 1. Uninformed prior

Model initialization:

Similarly,you also need to initializeθ0 andθ1 by sampling from Dirichlet (γθ).

Model iterations:

2.5.1 Sample formula for label label of document J

The 3rd step in the algorithm seems to be wrong, should be removed not?

Note:as soon as a new label for L J was assigned, this changes the counts that would affect the labeling of the subsequent Documents. This was, in fact, the whole principle behind a Gibbs sampler!

Generating values from the Gibbs sample

Both initialization and sampling iterations of the Gibbs sampling algorithm produce values for each variable (for iterations t = 1, 2, ..., t), in theory, the approximated value for any variable Z I can Simply be obtained by calculating:

As we know, the Gibbs sampling iteration enters the convergence phase is the stable distribution, so the general formula 59 Plus and not starting from 1, but B + 1 through T, to discard the T < b sampling results.

In this context, the Jordan boyd-graber (personal communication) also recommends looking at Neal's [all] discussion of Likeliho OD as a metric of convergence.

Phi Blog

1 2.6 optional:a Note on integrating out continuous Parameters

In sections 3 we discuss how to actually obtain values from a Gibbs sampler, as opposed to merely watching it walk around T He state space. (which might is entertaining, but wasn ' t really the point.) Our discussion includes convergence and burn-in, auto-correlation and lags, and other practical issues.

In Sections 4 we conclude with pointers to other things you might find it useful to read, as well as a invitation to tell US how we could make this document more accurate or more useful.

Finally LZ has a problem, Gibbs sampling can be used in continuous n-Gaussian distribution sampling it? What if it can be implemented, a Markov blanket?

from:http://blog.csdn.net/pipisorry/article/details/51525308

Ref:philip Resnik:gibbs sampling for the uninitiated

Random sampling and random simulation: the implementation of Gibbs sampling Gibbs sampling

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Random sampling and random simulation: the implementation of Gibbs sampling Gibbs sampling

Contact Us

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support