Research and Implementation of Naive Bayes Chinese text classifier (2) [88250, ZY, Sindy original]

Last Update:2018-12-03 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Reprinted by the author:

By: 88250

Blog: http:/blog.csdn.net/dl88250

MSN & Email & QQ: DL88250@gmail.com

Author: ZY

Blog: http:/blog.csdn.net/zyofprogrammer

By Sindy

E-mail: sindybanana@gmail.com

Part 1

The efficiency problem has been solved last time, and many buckets have been fixed. However, after reading some documents, I found a new theoretical problem.

Theoretical Problems

Naive Bayes text classification models are divided into two types:

Document Type
Word Frequency Type

Both use the following formula for classification:

The last classification model belongs to the document type, and the accuracy rate is about 50%. Theoretically, the accuracy rate of Naive Bayes classification can reach more than 80%. The accuracy rate of the document type is very low, mainly because the quality of the text in the training library is low. At present, we are already collecting training data to improve the quality of the training database.

Anterior Probability Calculation

There are two ways to calculate the anterior Probability:

Document Type

The number of documents in each category is not considered. The calculation is as follows:
P (CJ) = N (C = CJ)/N
N (C = CJ) indicates the number of training texts in the CJ category, and N indicates the total number of training texts.

Word Frequency Type

Consider the frequency of words appearing in each classification document. The formula is as follows:
P (CJ) = V Σ K = 1TF (x = XK, c = CJ)/W Sigma M = 1v Sigma K = 1TF (x = XK, c = cm)
Where, V indicates the total number of words (attributes) in the feature vocabulary, Tf (x = xi, c = CJ) indicates the sum of the number of times that attribute Xi appears in class CJ, W indicates the total number of classes.

Note:: The calculation method of the Class-condition probability must match the calculation method of the prior probability. If the prior probability is calculated in the document model, the Class-condition probability must also be calculated in the document model, and vice versa.

Class condition probability

Conditional probability calculation can be performed in two ways:

Document Type

The frequency of occurrence of a word in a document is not considered. The frequency of occurrence of a word in a document is considered only. 0 indicates no, and 1 indicates yes. The calculation is as follows:
P (XJ | CJ) = (n (x = xi, c = CJ) + 1)/(N (C = CJ) + V)
N (x = xi, c = CJ) indicates the number of training texts in the class CJ that contain the attribute x I; n (C = CJ) indicates the number of training texts in the class CJ; V indicates the total number of categories.

Word Frequency Type

Consider the frequency of words appearing in the document as follows:
P (XJ | CJ) = (Tf (x = xi, c = CJ) + 1)/(V + V Σ K = 1TF (x = XK, c = CJ ))
V indicates the total number of words (attributes) in the feature vocabulary, and TF (x = xi, c = CJ) indicates the sum of the number of times that the property Xi appears in class CJ.

Note::

The calculation method of the Class-condition probability must match the calculation method of the prior probability. If the prior probability is calculated in the document model, the Class-condition probability must also be calculated in the document model, and vice versa.
In order to avoid the probability result of class conditions being 0, Laplace probability estimation is adopted.

Preprocessing of the training database

To improve the classification efficiency and accuracy, the training database must be preprocessed. The main preprocessing steps are as follows:

Read all training texts under a certain category
Perform word segmentation for these texts
Filter useless words by word class and term length
Use the remaining words as the feature results of this classification and save them as text.

Currently, the pre-processor of the training database is mainly used for the Word Frequency classification model.

Current technical problems

Now the word frequency classification is well done, but a technical problem is still being solved, that is, the Chinese word segmentation component of Java. Originally, the Chinese word segmentation component was easy to use. Although the word segmentation effect was good, there was no part-of-speech tagging. ZY is studying the ICTCLAS word splitting component of the Chinese Emy of Sciences. The trial application of ictclas3.0 has been sent to the author for three days. No reply --!. JNI calling in version 1.0 is also very troublesome ....

The next article will evaluate our naive Bayes classifier. Please wait.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Research and Implementation of Naive Bayes Chinese text classifier (2) [88250, ZY, Sindy original]

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Research and Implementation of Naive Bayes Chinese text classifier (2) [88250, ZY, Sindy original]

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support