When using naive Bayesian (NB) to categorize documents, you use the build model for the document. Or from the Bayesian formula, where the right half is the process of generating a document, first select a Class C, and then according to this class to generate a certain probability document D. P (c) has nothing to say, satisfying the categorical distribution (a polynomial distribution at a time). and P (d|c) is more interesting, mainly can be used to simulate two distributions, one is the multivariate Bernoulli distribution , one is the polynomial distribution , here describes the two simulation methods of thought and difference.
multivariate Bernoulli distribution simulation document generation:
Now assuming that the glossary is V, it contains m words, then using the multivariate Bernoulli model to generate a document can be seen as such a process:
Iterate through the glossary and specify whether a word TK appears in document D, thus generating a document.
Such a process to produce a document is actually quite crude, and it actually just specifies which words in the vocabulary are included in the document.
It can be seen that the model ignores the number of occurrences of the word item, the position of the word item and the correlation between the word items (NB).
So a document can be expressed as a bool vector, that is:, where EI represents whether the word TI appears in document D.
The model needs to estimate the parameters, that is, the probability of the occurrence of TI in a Class C.
Decision rules: Maximizing (The nature of NB is applied here, that is, the word items are independent)
Note: There are some differences between this decision function and multivariate Bernoulli, that is, it does not take into account the absence of words, it can be seen that when the same document classification, this value is not necessary to calculate.
multi-item distribution simulation document generation:
Using a polynomial model to generate a document can be seen as such a process:
Suppose you have a dice, each face is a word item, of course, the probability of each face is not the same, and for different categories, the dice is not the same, that is, the probability distribution of each surface is different.
Then for each position in a document D, you can roll the dice. It will produce a word term with a certain probability, and eventually all the word items form a document.
You can see that using a polynomial distribution to generate a document it seems less rough than using multiple Bernoulli to generate a document, because:
the model ignores the differences between different locations and the correlations between words (NB). That is, a word item T appearing in position A and appearing at position B is no different ( Word bag model ).
The number of items appearing in the polynomial model is modeled.
So a document can be thought of as a vector of a series of word items, where TK represents a word item in which duplicate words can appear.
The model needs to estimate the parameters, that is, the probability of a word item TK, in order not to confuse this parameter with the meaning of the parameters in the multivariate Bernoulli model, remember that this probability is the probability of generating a side of the dice of the document, and that the sum of all faces is 1. and multivariate Bernoulli that parameter is the probability that a word item appears or not.
Decision rules: Maximizing (The nature of NB is also applied here, that is, the word items are independent)
Note: There are some differences between this decision function and the polynomial distribution, that is, the coefficients are omitted because the coefficients are constants for the same document.
multi-Bernoulli vs polynomial:
So the question is, which of these two models is good? It must have been a long time.
1. The Bernoulli model does not consider the number of occurrences of the word item , but the polynomial model is considered;
2. The Bernoulli model is suitable for dealing with short file , and polynomial model is suitable for long document processing;
3. The Bernoulli model has a better effect when the feature number is less , and the polynomial model has better effect when the feature is more .
4. An estimate of more than the term "the":
The 4th article can obviously feel the difference of two models, it is also very good to understand: for the word "the", almost every document appears, so in the multivariate Bernoulli model its probability is nearly 1. And in the polynomial model "the" is just one side of the dice. The probability of its occurrence is great, but it is only 0.05.