Transferred from: http://www.letiantian.me/2014-10-12-three-models-of-naive-nayes/
Naive Bayesian is a very good classifier, in the use of naive Bayesian classifier to divide the message there is a simple introduction to naive Bayesian.
If a sample has n characteristics, denoted by [latex]x_{1},x_{2},..., X_{n}[/latex], it is divided into the possibility of class [Latex]y_{k}[/latex] [Latex]p (y_{k}|x_{ 1},X_{2},..., X_{n}) [/latex] is:
[Latex]
P (y_{k}|x_{1},x_{2},..., x_{n}) = P (y_{k}) \prod_{i=1}^{n}p (X_{i}|y_{k})
[/latex]
The values on the right side of the equal sign in the above can be obtained by training. According to the above formula can be asked for a data belonging to the probability of each classification (the sum of these probabilities is not necessarily 1), the data should belong to the most likely classification.
In general, if a sample has no characteristics [Latex]x_{i}[/latex], then [Latex]p (X_{i}|y_{k}) [/latex] will not participate in the calculation. However, the following Bernoulli models are excluded.
The above is the basic content of naive Bayes.
Gaussian model
Some characteristics may be continuous variables, such as the height of the person, the length of the object, these features can be converted to discrete values, such as if the height is below 160cm, the characteristic value is 1, between 160cm and 170cm, the eigenvalues are 2; above 170cm, the eigenvalues are 3. It is also possible to convert the height to 3 features, namely F1, F2, F3, if height is below 160cm, the values of these three characteristics are 1, 0, 0, if height is above 170cm, the values of these three features are 0, 0, 1, respectively. But these methods are not delicate enough, Gaussian model can solve this problem. The Gaussian model assumes that all observations of one of these characteristics conform to a Gaussian distribution, i.e.:
[Latex]
P (x_{i}|y_{k}) = \frac{1}{\sqrt{2\pi\sigma_{y_{k}}^{2}}}exp (-\frac{(X_{i}-\mu_{y_{k}}) ^2} {2\sigma_{y_{k}}^{2}})
[/latex]
Here's an example in Sklearn:
>>> from Sklearn import datasets>>> iris = Datasets.load_iris ()>>> Iris.feature_names# Four characteristics of the name [' Sepal Length (cm) ',' Sepal width (cm) ',' Petal Length (cm) ',' Petal width (cm) ']>>> Iris.dataarray ([[5.1,3.5,1.4,0.2], [4.9,3.,1.4,0.2], [4.7,3.2,1.3,0.2], [4.6,3.1,1.5,0.2], [5.,3/6m1.4,0.2], [5.4,3.9,1.7,0.4], [4.6,3.4,1.4,0.3], [5.,3.4,1.5,0.2], ... [6.5,3.,5.2,2.], [6.2,3.4,5.4,2.3], [5.9,3.,5.1,1.8])#类型是numpy. Array>>> iris.data.size600#共600/4=150 A sample>>> Iris.target_namesarray ([' Setosa ',' Versicolor ',' Virginica '], dtype=' | S10 ')>>> Iris.targetarray ([0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,.....,0,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, ...,2,2,2,2,2,2,2,2,2,2,2,2])>>> iris.target.size150>>> from Sklearn.naive_bayes Import gaussiannb>>> CLF = gaussiannb ()>>> clf.fit (Iris.data, iris.target)>> > clf.predict (iris.data[0]) array ([0]) # prediction Correct >>> clf.predict (iris.data[149]) Array ([ 2]) # forecast correct >>> data = Numpy.array ([6,4,6,2])>>> clf.predict (data) Array ([2]) # The predictions are reasonable .
Polynomial model
This model is commonly used for text categorization, characterized by a word, and the number of occurrences of a word.
[Latex] P (x_{i}|y_{k}) = \frac{n_{y_{k}x_{i}}+\alpha}{n_{y_{k}}+\alpha N}
[/latex]
where [Latex]n_{y_{k}x_{i}}[/latex] is the total number of occurrences of the feature [Latex]x_{i}[/latex] under the category [Latex]y_{k}[/latex]; [Latex]n_{y_{k}}[/latex] Is the total number of occurrences of all features under category [Latex]y_{k}[/latex]. corresponding to the text category, if the word word
appears 5 times in a document classified as a category, the label1
value of [Latex]n_{label1,word}[/latex] increases by 5. If the word is removed, the value of [Latex]n_{label1,word}[/latex] increases by 1. [Latex]n[/latex] is the number of features, in the text category is the number of all words after the weight. [Latex]\alpha[/latex] is a value range of [0,1], the more common is the value of 1.
The feature in the sample to be predicted [Latex]x_{i}[/latex] may not have occurred during training, and if not, the [Latex]n_{y_{k}x_{i}}[/latex] value is 0, and if it is taken directly to calculate the probability that the sample belongs to a classification, the result will be 0. by adding [Latex]\alpha[/latex] to the numerator, adding [Latex]\alpha N[/latex] in the denominator can solve the problem.
The following code is from an example of Sklearn:
>>>Import NumPyAs NP>>> X = Np.random.randint (5, size= (6, 100)) >>> y = Np.array ([1, 2, Span class= "Hljs-number" >3, 4, 5, 6] ) >>> from Sklearn.naive_bayes import multinomialnb>>> CLF = MultinomialNB () >>> clf.fit (X, y) multinomialnb (alpha=1.0, class_prior=< Span class= "Hljs-keyword" >none, Fit_prior=true) >>> Print (Clf.predict (x[2]) [3]
It is important to note that the polynomial model can continue to train other datasets without having to put two datasets together for training after the end of a data set. In Sklearn, the Partial_fit () method of the MULTINOMIALNB () class can be used for this kind of training. This approach is especially well suited to situations where the training set is large and memory cannot be placed at once.
partial_fit()
All classification labels need to be given at the first call.
>>> Import NumPy>>> from Sklearn.naive_bayes ImportMULTINOMIALNB>>> CLF = MULTINOMIALNB () >>> Clf.partial_fit (Numpy.array ([1, 1]), Numpy.array ([' AA ']), [' AA ',' BB '])gaussiannb () >>> Clf.partial_fit ( Numpy.array ([6,1]), Numpy.array ([' BB ')])gaussiannb () >>> clf.predict ( Numpy.array ([9,1])) array ([' BB '], dtype=' | S2 ')
Bernoulli model
In the Bernoulli model, for a sample, its characteristics are characterized by a global character.
In the Bernoulli model, the values of each feature are Boolean, that is, true and false, or 1 and 0. In the text category, there is a feature that appears in a document.
If the characteristic value [Latex]x_{i}[/latex] value is 1, then
[Latex]
P (x_{i}|y_{k}) = P (X_{i}=1|y_{k})
[/latex]
If the characteristic value [Latex]x_{i}[/latex] value is 0, then
[Latex]
P (x_{i}|y_{k}) = 1-p (X_{i}=1|y_{k})
[/latex]
This means that "there is no feature" is also a feature. The following example comes from the Sklearn official documentation:
>>>Import NumPyAs NP>>> X = Np.random.randint (2, Size= (6, 100)) >>> Y = Np.array ([1, 2, Span class= "Hljs-number" >3, 4, 4, 5] ) >>> from Sklearn.naive_bayes import bernoullinb>>> CLF = BernoulliNB () >>> clf.fit (X, Y) bernoullinb (Alpha=1.0, Binarize=0.0, Class_prior=none, Fit_prior=True) >>> print (clf.predict (x[2])) [3 ]
The BERNOULLINB () class also has a Partial_fit () function.
The application of polynomial model and Bernoulli model in text classification
A good explanation is given in the text classification algorithm based on naive Bayes.
In the polynomial model:
In a polynomial model, a document d= (T1,t2,..., tk) is created, and TK is the word that appears in the document, allowing repetition,
Priori probability P (c) = total number of words under Class C/number of words for the entire training sample
Class conditional probability P (tk|c) = (the sum of the number of times the word TK has occurred in each document in Class C)/(the total number of words under Class C +| v|)
V is the word list of the training sample (that is, the word is extracted, the word appears multiple times, and only one is counted), | V| is the number of words that the training sample contains. P (tk|c) can be seen as the evidence that word TK provides in proving that D belongs to Class C, while P (c) can be considered as a percentage of the overall size of category C (how much possible).
In the Bernoulli model:
P (c) = Total files under Class C/number of files for the entire training sample
P (TK|C) = (number of files containing word TK under Class C +1)/(total number of words under Class C +2)
Naive Bayesian Classification algorithm (3)