We always produce a lot of text in our daily life, and if each text is stored as a document, then each document is an ordered sequence of words d= (W1,W2,⋯,WN) from human observation.
Corpus containing M-piece documents
The purpose of the statistical text modeling is to ask how the word sequences in the corpus are generated. Statistics are described by people as guessing the game of God, all the corpus texts produced by human beings we can all be seen as a great God in heaven to throw the dice generated, we observe only God play the result of the game-word sequence composition of the corpus, and God play the game of the process for us is a black box. So in statistical text modeling, we want to guess how God plays the game, and specifically, the two core questions are what kind of dice God has, and how God throws the dice;
The first problem is to indicate what parameters are in the model, the probability of each face of the dice corresponds to the parameters in the model; the second question is what the rules of the game are, and God may have different types of dice, and God can throw the dice in a certain rule to produce a word sequence.
God Roll the Dice
4.1 Unigram Model
Assuming that there is a V-word v1,v2,⋯vv in our dictionary, the simplest unigram Model is to assume that God is producing text according to the rules of the game as follows.
God's only dice the probability of each face is p→= (P1,P2,⋯,PV), so each throw the dice is similar to a cast steel coins when the Benou benefit experiment, recorded as W∼mult (w|p→).
God throws a V-face dice
For a document d=w→= (W1,W2,⋯,WN), the probability that the document will be generated is
P (w→) =p (W1,W2,⋯,WN) =p (W1) p (W2) ⋯p (WN)
and between documents and documents we think is independent, so if there are multiple documents in the Corpus w= (w1−→,w2−→,..., wm−→), then the probability of the corpus is
P (W) =p (w1−→) p (w2−→) ⋯p (wm−→)
In the Unigram Model, we assume that the document is independent and interchangeable, and that the word in the document is independent and interchangeable, so a document is equivalent to a bag with some words in it, and the order information of the word is irrelevant, and such a model is also called a word bag model (bag-of-words).
Assuming that the total word frequency in the corpus is N, in all n words, if we pay attention to the number of occurrences of each word vi ni, then n→= (N1,N2,⋯,NV) is just a polynomial distribution
P (n→) =mult (n→|p→,n) = (nn→) ∏k=1vpnkk
At this point, the probability of the corpus is
P (W) =p (w1−→) p (w2−→) ⋯p (wm−→) =∏K=1VPNKK
Of course, one of our important tasks is to estimate the parameters of the model, that is, to ask what the probability of the p→ of the dice is, according to the frequency faction of the statistician, to maximize P (W) using maximum likelihood estimation, so the estimated value of the parameter pi is
Pi^=nin.
For the above model, the Bayes statisticians will disagree, and they will be very picky in criticizing the assumption that God has only one fixed dice is unreasonable. In the view of the Bayesian school, all parameters are random variables, the dice p→ in the above model is not only fixed, it is also a random variable. So according to the Bayesian School of Philosophy, God is playing the game according to the following process
In this jar of God, dice can be infinitely many, some types of dice more, some types of dice, so from the point of view of probability distribution, the jar inside the dice p→ obey a probability distribution P (p→), this distribution is called a priori distribution of parameter p→.
Unigram Model under the Bayesian perspective
Above the Bayesian school of the rules of the game under the assumption that the probability of the production of corpus W how to calculate it. Since we do not know which dice p→ God is using, every dice is likely to be used, but the probability of use is determined by a priori distribution P (p→). For each specific dice p→, the probability of producing data from the dice is P (w|p→), so the probability of the final data generation is to sum up the probability of the data produced on each dice p→.
P (W) =∫p (w|p→) p (p→) dp→
A priori distribution P (p→) can have a wide variety of options under the Bayesian analysis framework, noting
P (n→) =mult (n→|p→,n)
is actually calculating the probability of a polynomial distribution, so a better choice for a priori distribution is the conjugate distribution of a polynomial distribution, i.e. the Dirichlet distribution
Dir (p→|α→) =1δ (α→) ∏k=1vpαk−1k,α→= (Α1,⋯,ΑV)
Here, Δ (α→) is the normalized factor Dir (α→), i.e.
Δ (α→) =∫∏k=1vpαk−1kdp→.
The Unigram Model under the Dirichlet priori
Probabilistic graph model of Unigram models
Review some of the knowledge of the Drichlet distribution described in the previous section, where it is important to
Dirichlet a priori + multiple distribution data → posterior distribution of Dirichlet distribution
Dir (p→|α→) +multcount (n→) =dir (p→|α→+n→)
Thus, given the p→ of the parameters of a priori distribution Dir (p→|α→), the frequency of the various words of the data N→∼mult (n→|p→,n) is a number of distribution, so no need to calculate, we can introduce a posteriori distribution is
P (p→| w,α→) =dir (p→|n→+α→) =1δ (n→+α→) ∏k=1vpnk+αk−1kdp→ (1)
In the Bayesian framework, the parameter p→ how to estimate it. Since we already have the posterior distribution of the parameters, the reasonable way is to use the maximum value of the posterior distribution, or the average of the parameters under the posterior distribution. In this document, we take the average as an estimate of the parameter. Using the conclusions in the previous section, because the posterior distribution of p→ is Dir (p→|n→+α→),
E (p→) = (N1+α1∑vi=1 (ni+αi), N2+α2∑vi=1 (ni+αi), ⋯,nv+αv∑vi=1 (ni+αi))
In other words, for each pi, we use the next formula to estimate the parameters.
Pi^=ni+αi∑vi=1 (Ni+αi) (2)
Considering that the physical meaning of αi in the Dirichlet distribution is a priori pseudo count of events, the meaning of this estimate is straightforward: the estimate of each parameter is a priori pseudo count of its corresponding event and the proportion of the count in the data and the total count.
Further, we can calculate the probability of the text corpus to be
P (w|α→) =∫p (w|p→) p (p→|α→) Dp→=∫∏k=1vpnkkdir (p→|α→) dp→=∫∏k=1vpnkk1δ (α→) ∏k=1vpαk−1kdp→=1δ (α→) ∫∏k=1Vpnk+αk−1kdp→=Δ ( n→+α→) Δ (α→) (3)
4.2 Topic Model and pLSA
The above Unigram model is a very simple models, the assumptions in the model seems too simple, and human writing to produce every word of the process gap is larger, there is no better model it.
We can take a look at how people in everyday life are thinking about articles. If we're going to write an article, it's often a matter of deciding which topics to write. For example, to conceive a natural language processing-related article, perhaps 40\% talks on linguistics, 30\% talking about probability statistics, 20\% talking about computers, and 10\% talking about other topics: when it comes to linguistics, the words we can easily think of include: grammar, sentence, Chomsky, syntactic analysis, subject ... Talking about probability statistics, it is easy to think of the following words: probability, model, mean, variance, proof, independent, Markov chain 、... Talk about the computer, we easily think of the word is: memory, hard disk, programming, binary, object, algorithm, complexity ... ;
The reason we can think of these words right away is because these words have a high probability of appearing under the corresponding theme. It is natural to see that an article is usually made up of multiple themes, and each topic may be described by some of the most frequent words associated with the subject.
The above intuitive idea was first defined mathematically in the pLSA (probabilistic latent semantic analysis) model given by Hoffman in 1999. Hoffman that a document can be mixed by multiple themes (Topic), and each Topic is a probabilistic distribution of words, and each word in the article is generated by a fixed Topic. The following figure is an example of several topic in English.
Topic is the probability distribution on the vocab.
All human thinking and writing can be thought of as God's behavior, we continue to return to God's assumption, then in the PLSA model, Hoffman that God is based on the following rules of the game to generate text.
The document generation process for the above PLSA model can be graphically represented as
Document generation process for PLSA models
We can see that in the above game rules, documents and documents are independent and interchangeable, and the word in the same document can be exchanged independently or as a bag-of-words model. The game of K-Topic-word dice, we can be remembered as φ→1,⋯,φ→k, for each document in the Corpus c= (D1,D2,⋯,DM) containing the M document DM, there will be a specific doc-topic dice θ→m, all the corresponding dice are recorded as Θ→1,⋯, Θ→m. For convenience, we assume that each word w is a number that corresponds to the face of the Topic-word dice. So in pLSA this model, the generation probability of each word in the M document DM is
P (W|DM) =∑Z=1KP (w|z) p (Z|DM) =∑Z=1KΦZWΘMZ
So the probability of the entire document being generated
P (W→|DM) =∏I=1N∑Z=1KP (wi|z) p (Z|DM) =∏i=1n∑z=1kφzwiθdz
Because the documents are independent of each other, we are also apt to write the entire corpus generation probability. To solve the plsa of this Topic model, we can use the famous EM algorithm to obtain the local optimal solution, because the solution of this model is not the main point of this paper, the students interested in reference to Hoffman's original paper, omitted here.
This article links to reprint from: [Lda mathematical Gossip-4] text modeling Source: The Firelight swaying!