Research and Implementation of Naive Bayes Chinese text classifier (2) [88250, ZY, Sindy original]

Source: Internet
Author: User

Reprinted by the author:

By: 88250

Blog: http:/blog.csdn.net/dl88250

MSN & Email & QQ: DL88250@gmail.com

Author: ZY

Blog: http:/blog.csdn.net/zyofprogrammer

By Sindy

E-mail: sindybanana@gmail.com

Part 1

The efficiency problem has been solved last time, and many buckets have been fixed. However, after reading some documents, I found a new theoretical problem.

Theoretical Problems

Naive Bayes text classification models are divided into two types:

  • Document Type
  • Word Frequency Type

Both use the following formula for classification:

    CNB = Arg max (P (CJ) * 1_1c P (XI | CJ ))
    Where p (CJ) is the prior probability of Class J, P (XI | CJ) is the class conditional probability of feature quantity XI in class CJ

The last classification model belongs to the document type, and the accuracy rate is about 50%. Theoretically, the accuracy rate of Naive Bayes classification can reach more than 80%. The accuracy rate of the document type is very low, mainly because the quality of the text in the training library is low. At present, we are already collecting training data to improve the quality of the training database.

Anterior Probability Calculation

There are two ways to calculate the anterior Probability:

  • Document Type
  • The number of documents in each category is not considered. The calculation is as follows:
    P (CJ) = N (C = CJ)/N
    N (C = CJ) indicates the number of training texts in the CJ category, and N indicates the total number of training texts.

  • Word Frequency Type
  • Consider the frequency of words appearing in each classification document. The formula is as follows:
    P (CJ) = V Σ K = 1TF (x = XK, c = CJ)/W Sigma M = 1v Sigma K = 1TF (x = XK, c = cm)
    Where, V indicates the total number of words (attributes) in the feature vocabulary, Tf (x = xi, c = CJ) indicates the sum of the number of times that attribute Xi appears in class CJ, W indicates the total number of classes.

Note:: The calculation method of the Class-condition probability must match the calculation method of the prior probability. If the prior probability is calculated in the document model, the Class-condition probability must also be calculated in the document model, and vice versa.

Class condition probability

Conditional probability calculation can be performed in two ways:

  • Document Type
  • The frequency of occurrence of a word in a document is not considered. The frequency of occurrence of a word in a document is considered only. 0 indicates no, and 1 indicates yes. The calculation is as follows:
    P (XJ | CJ) = (n (x = xi, c = CJ) + 1)/(N (C = CJ) + V)
    N (x = xi, c = CJ) indicates the number of training texts in the class CJ that contain the attribute x I; n (C = CJ) indicates the number of training texts in the class CJ; V indicates the total number of categories.

  • Word Frequency Type
  • Consider the frequency of words appearing in the document as follows:
    P (XJ | CJ) = (Tf (x = xi, c = CJ) + 1)/(V + V Σ K = 1TF (x = XK, c = CJ ))
    V indicates the total number of words (attributes) in the feature vocabulary, and TF (x = xi, c = CJ) indicates the sum of the number of times that the property Xi appears in class CJ.

Note::

  • The calculation method of the Class-condition probability must match the calculation method of the prior probability. If the prior probability is calculated in the document model, the Class-condition probability must also be calculated in the document model, and vice versa.
  • In order to avoid the probability result of class conditions being 0, Laplace probability estimation is adopted.
Preprocessing of the training database

To improve the classification efficiency and accuracy, the training database must be preprocessed. The main preprocessing steps are as follows:

  1. Read all training texts under a certain category
  2. Perform word segmentation for these texts
  3. Filter useless words by word class and term length
  4. Use the remaining words as the feature results of this classification and save them as text.

Currently, the pre-processor of the training database is mainly used for the Word Frequency classification model.

Current technical problems

Now the word frequency classification is well done, but a technical problem is still being solved, that is, the Chinese word segmentation component of Java. Originally, the Chinese word segmentation component was easy to use. Although the word segmentation effect was good, there was no part-of-speech tagging. ZY is studying the ICTCLAS word splitting component of the Chinese Emy of Sciences. The trial application of ictclas3.0 has been sent to the author for three days. No reply --!. JNI calling in version 1.0 is also very troublesome ....

The next article will evaluate our naive Bayes classifier. Please wait.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.