Reprinted by the author:
By: 88250
Blog: http:/blog.csdn.net/dl88250
MSN & Email & QQ: DL88250@gmail.com
Author: ZY
Blog: http:/blog.csdn.net/zyofprogrammer
By Sindy
E-mail: sindybanana@gmail.com
Part 1
The efficiency problem has been solved last time, and many buckets have been fixed. However, after reading some documents, I found a new theoretical problem.
Theoretical Problems
Naive Bayes text classification models are divided into two types:
- Document Type
- Word Frequency Type
Both use the following formula for classification:
CNB = Arg max (P (CJ) * 1_1c P (XI | CJ ))
Where p (CJ) is the prior probability of Class J, P (XI | CJ) is the class conditional probability of feature quantity XI in class CJ
The last classification model belongs to the document type, and the accuracy rate is about 50%. Theoretically, the accuracy rate of Naive Bayes classification can reach more than 80%. The accuracy rate of the document type is very low, mainly because the quality of the text in the training library is low. At present, we are already collecting training data to improve the quality of the training database.
Anterior Probability Calculation
There are two ways to calculate the anterior Probability:
- Document Type
The number of documents in each category is not considered. The calculation is as follows:
P (CJ) = N (C = CJ)/N
N (C = CJ) indicates the number of training texts in the CJ category, and N indicates the total number of training texts.
- Word Frequency Type
Consider the frequency of words appearing in each classification document. The formula is as follows:
P (CJ) = V Σ K = 1TF (x = XK, c = CJ)/W Sigma M = 1v Sigma K = 1TF (x = XK, c = cm)
Where, V indicates the total number of words (attributes) in the feature vocabulary, Tf (x = xi, c = CJ) indicates the sum of the number of times that attribute Xi appears in class CJ, W indicates the total number of classes.
Note:: The calculation method of the Class-condition probability must match the calculation method of the prior probability. If the prior probability is calculated in the document model, the Class-condition probability must also be calculated in the document model, and vice versa.
Class condition probability
Conditional probability calculation can be performed in two ways:
- Document Type
The frequency of occurrence of a word in a document is not considered. The frequency of occurrence of a word in a document is considered only. 0 indicates no, and 1 indicates yes. The calculation is as follows:
P (XJ | CJ) = (n (x = xi, c = CJ) + 1)/(N (C = CJ) + V)
N (x = xi, c = CJ) indicates the number of training texts in the class CJ that contain the attribute x I; n (C = CJ) indicates the number of training texts in the class CJ; V indicates the total number of categories.
- Word Frequency Type
Consider the frequency of words appearing in the document as follows:
P (XJ | CJ) = (Tf (x = xi, c = CJ) + 1)/(V + V Σ K = 1TF (x = XK, c = CJ ))
V indicates the total number of words (attributes) in the feature vocabulary, and TF (x = xi, c = CJ) indicates the sum of the number of times that the property Xi appears in class CJ.
Note::
- The calculation method of the Class-condition probability must match the calculation method of the prior probability. If the prior probability is calculated in the document model, the Class-condition probability must also be calculated in the document model, and vice versa.
- In order to avoid the probability result of class conditions being 0, Laplace probability estimation is adopted.
Preprocessing of the training database
To improve the classification efficiency and accuracy, the training database must be preprocessed. The main preprocessing steps are as follows:
- Read all training texts under a certain category
- Perform word segmentation for these texts
- Filter useless words by word class and term length
- Use the remaining words as the feature results of this classification and save them as text.
Currently, the pre-processor of the training database is mainly used for the Word Frequency classification model.
Current technical problems
Now the word frequency classification is well done, but a technical problem is still being solved, that is, the Chinese word segmentation component of Java. Originally, the Chinese word segmentation component was easy to use. Although the word segmentation effect was good, there was no part-of-speech tagging. ZY is studying the ICTCLAS word splitting component of the Chinese Emy of Sciences. The trial application of ictclas3.0 has been sent to the author for three days. No reply --!. JNI calling in version 1.0 is also very troublesome ....
The next article will evaluate our naive Bayes classifier. Please wait.