Nbsvm
Naive Bayes (Naive bayers) and Support vector Machine (SVM) are the basic models commonly used in text categorization. Under different data sets, different characteristics and different parameters, the effects of the two are varied. In general, NB is better than SVM in short text, and SVM performs better on long text.
NBSVM from the paper <baselines and bigrams simple, good sentiment and Topic Classification>, is a new classification method proposed by the author, it combines NB and SVM, The principle can be summed up in a sentence:
"Trust NB unless SVM is very confident"
The authors have experimented on different datasets, and have made better results than individual NB and SVM, and the results and principles can be found in the original paper.
The author also provides the MATLAB code, this article will use Python to carry on the simple demo implementation to the NBSVM.
Problem description
To achieve a small Chinese news text classification problem. The training set data consists of 11 kinds of news, the text has already done participle, each kind of news has 1600, corresponding label from 1 to 11. The test set consists of 11*160 data. The experimental data and the results of the comparisons come to self-knowledge column, detailed address see reference.
Algorithm steps
- vectorization, TF-IDF feature extraction
count_v0= Countvectorizer (); Counts_all = Count_v0.fit_transform (train_x + test_x); count_v1= Countvectorizer (vocabulary= Count_ V0.VOCABULARY_); Counts_train = count_v1.fit_transform (train_x); Print ("The Shape of Train is" + repr (counts_train.shape)) Count_v2 = Countvectorizer (vocabulary= Count_ V0.VOCABULARY_); Counts_test = count_v2.fit_transform (test_x); print ("The shape of test is" + repr (counts_test.shape)) tfidft Ransformer = Tfidftransformer (); train_x = Tfidftransformer.fit (counts_train). Transform (Counts_train); test_x = Tfidftransformer.fit (counts_test). Transform (Counts_test);
- Calculate Log-count-ratio
To simplify the problem, we turn the multiple classification (n) problem into several two classification problems, so we need to convert the label to one-hot encoded form, and the corresponding y_i value is 1 and 0.
def PR (x, Y_i, y): p = x[y==y_i].sum (0) # axis = 0 return (p+1)/((y==y_i). SUM () +1) #正则化 R = Sparse.csr_matrix (NP . log (pr (x,1,y)/PR (x,0,y)))
- Use x r (elementwise-product) instead of the original training data x in the linear classifier as the training data of the NBSVM.
X_NB = x.multiply (r)
- Using the new training data and training tag, the classifier is trained, and the LR classifier is used here.
CLF = Logisticregression (C=1, Dual=false, N_jobs=1). Fit (x_nb, y)
- The predictive test set also x_test r the data of the test set to the same processing.
Clf.predict (X_test.multiply (R))
Experimental results
The accuracy rate in the validation set reaches 0.8565. The following table is an experimental effect that other methods classify on the dataset.
As can be seen, NBSVM not only improved the naïve Bayesian and SVM methods of the individual role, and compared with the method of deep learning time less, the effect of silk is no inferior.
Feelings
In the text classification model, the optimal basic classification model often has no definite answer due to the difference of data sets and features. To solve this problem, NBSVM combines the two most common basic classification models by training a weighting factor (which can be regarded as a regularization of the model) to form a basic model with strong versatility and better effect.
The authors follow the application of NBSVM on a larger data set (10w+), and found that the performance of NBSVM can pull apart the performance of NB and SVM, and is better than most shallow deep learning models. Therefore, NBSVM can be used as a good baseline model for text categorization.
NBSVM Complete code
classNbsvmclassifier (baseestimator,classifiermixin):def __init__(Self, c=1.0, Dual=false, N_jobs=1): Self. C=C self.dual=Dual Self.n_jobs=N_jobsdefPredict (self, x):#Verify that model have been fitCheck_is_fitted (Self, ['_r','_CLF']) returnself._clf.predict (x.multiply (self._r))defPredict_proba (self, x):#Verify that model have been fitCheck_is_fitted (Self, ['_r','_CLF']) returnSelf._clf.predict_proba (x.multiply (self._r))defFit (self, x, y):#Check that X and Y are correct shapeX, y = check_x_y (x, Y, accept_sparse=True)defPR (x, Y_i, y): P= x[y==y_i].sum ()Print(P)Print((y==y_i). SUM ())return(p+1)/((y==y_i). SUM () +1) Self._r= Sparse.csr_matrix (Np.log (pr (x,1,y)/PR (x,0,y)))#PR (x,1,y) Number of positve samples PR (x,0,y) Number of negatve samplesX_NB =x.multiply (self._r) SELF._CLF= Logisticregression (c=self. C, Dual=self.dual, n_jobs=self.n_jobs). Fit (x_nb, y)returnSelf
Full code Address: Https://github.com/sanshibayuan/NBSVM
Reference:
Https://github.com/sidaw/nbsvm
Https://nlp.stanford.edu/~sidaw/home/projects:nbsvm
https://zhuanlan.zhihu.com/p/26729228
Python implementations of NBSVM