1. Preface
Tagging a large number of text data that needs to be categorized is a tedious, time-consuming task, while the real world, such as the presence of large amounts of unlabeled data on the Internet, is easy and inexpensive to access. In the following sections, we introduce the use of semi-supervised learning and EM algorithms to fully combine a large number of unlabeled samples in order to obtain a higher accuracy of text classification. This article uses the polynomial naive Bayes as the classifier, training with the EM algorithm, using tagged data and unmarked data. The relationship between multi-class classification accuracy and the proportion of unlabeled data in training set is studied. and explore ways to reduce the computational cost of EM processes to speed up training. The results show that the semi-supervised EM-NB classifier can achieve the accuracy of more than 50% in the case of only 2% labeled data, and the accuracy rate is greater than 70% in the case of 33% labeled data. This article comes from Appendix 1 in the Reference, and the detailed code and introduction can be found in the links.
2. Introduction to the Model
3. Key Code Implementation
X. References
Appendix 1:text Classification Using EM and semi-supervised learning
A detailed semi-supervised learning method using EM algorithm applied to naive Bayesian text classification