Author topic Model Analysis

Source: Internet
Author: User

Generative models for document (document generation model)

The document here consists of two parts: the set of authors of the article, which forms a set of words in the content of the article. For example:

Document 1 indicates:

Document 2 indicates:

We represent the set of authors of document D and the word set of content. For example: {Zhang San, Li Si}, = {good study day up };

= {Wang Wu, Li Si}, = {search knowledge read more books }.

The C set of D documents can be expressed:

C = {(,),(,),......, (,)}.

What is generative models for document )?

In our author topic model (the author's topic model), each author corresponds to a number of distributions on topics. Each topic corresponds to a number of distributions on words.

How is the content of a document D generated?

For each word (Word) in document D, select an author from the medium probability of the author set, and then select a topic randomly based on the probability distribution of the author on the topics, finally, the word is randomly generated based on the probability distribution of the topic on words.

If there is a word in document D, this process must be repeated.

Every document generates the entire data set.

Let's use an example to explain this process.

Document 1 indicates:

For the first word "good", it is generated in this way: From = {Zhang San, Li Si} to a random author, assuming that "Zhang San" is selected ", then, a topic is randomly selected based on the probability distribution (0.1, 0.6, 0.3) of John's topics. topic2 is more likely to be selected, assuming topic2 is selected. Then, based on the probability distribution of topic2 on words, select word. This time, "good" is selected ".

Author topic model considers that all document content is generated in this way. The problem we face is that if we accept that our datasets are generated in this way, we will infer what the generation process is like, that is to estimate how the probability of each author is distributed on the topic, and how the probability of each topic is distributed on the word.

Graph Model)

Next we will discuss the graph model of the author topic model.

We can understand this figure as follows:

Arrows are considered as conditional dependencies between variables,

The condition depends on W, or W probability generation. The box indicates repeated sampling (generated). The number of times is marked in the lower right corner. For example, n times are repeatedly sampled.

In the author topic model, the word w in the document (article) and the Co-author are both known, or observability. On the graph, they are filled circles.

Each author corresponds to a number of distributions on the topics. Each topic corresponds to a number of distributions on words.

And both have a symmetric Dirichlet prior. That is, sum relies on prior knowledge. A indicates the total number of authors in the dataset, and t indicates the number of topics.

For each word in a document, we randomly sample an Author X from the (even distribution on the authors set), and then sample a topic Z from multiple distributions corresponding to X, sample A word from multiple distributions corresponding to topic Z. The sampling process repeats and a document is generated. D documents repeat this process to generate the entire dataset.

Bayesian Estimation of the model parameters

Bayesian Estimation of model parameters

In the author topic model, there are two groups of unknown parameters: Author-topic distribution,

And topic-word distribution. We estimate them by using the glassampling method.

Sample author and topic for each word according to the following formula.

It indicates that the word I in an article is assigned to the topic J and author K. The I-th word is the M-th word in the dictionary. Indicates the topic and author allocation of other words except the word I.

Indicates the total number of words m allocated to topic J before this assignment. The total number of topic J allocated by the author K so far.

V indicates the total number of dictionaries (the number of all different words in the dataset) and T indicates the number of topics.

During algorithm estimation, we only need to trace two matrices. One is V x T (Word

By topic) count matrix, and K x T (author by topic) count matrix.

Finally, the author-topic distribution and topic-word distribution are estimated based on the two counting matrices.

Indicates the probability that topic J uses the word M.

Probability of the author K in topic J

At the beginning, the algorithm randomly assigns a topic to each word in each article, and randomly assigns an author (from the author set of this Article ). Then, according to equation (1), each word in each article is sampled and repeated I times.

Assume that the V x T (Word by topic) count matrix and the k x T (author by topic) count matrix are as follows:

 

Document 1 indicates:

How do I allocate topics and author for "good?

Total: {Zhang San, topic1}, {Zhang San, topic2}, {Zhang San, topic3 },

There are six possibilities: {Li Si, topic1}, {Li Si, topic2}, {Li Si, topic3.

The probability of these six cases is different, according to the equation (1)

Similarly, we can find the probability of other situations. Based on this probability distribution, a random allocation is generated.

Repeat this process to sample each word.

Finally, sum is obtained based on equations (2) and (3.

Appendix:

There is a Bayesian network diagram in the original at paper. To understand this graph, you only need a basic knowledge of Bayesian Network. Therefore, we need to list what needs to be understood here.

For example, the joint probability P (A, B, C) = P (c | a, B) P (B | A) P (a) can be expressed

The arrow indicates the conditional probability, and the circle represents a random variable. In this way, we can easily draw a Bayesian Network corresponding to the conditional probability.

For more complex probability models, such

Given the probability of N conditions, when n is very large, it is unrealistic to plot every random variable in the graph. This is to draw the random variable into the box:

This indicates that n tn are repeated.

In a probability model, some are the observed random variables, and some are the random variables we need to estimate. The two variables need to be separated in the graph:

For example, the filled circle indicates that the random variable is observed and has been set to the observed value.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.