Python implementations of NBSVM

Last Update:2018-05-27 Source: Internet

Author: User

Tags svm

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Nbsvm

Naive Bayes (Naive bayers) and Support vector Machine (SVM) are the basic models commonly used in text categorization. Under different data sets, different characteristics and different parameters, the effects of the two are varied. In general, NB is better than SVM in short text, and SVM performs better on long text.

NBSVM from the paper <baselines and bigrams simple, good sentiment and Topic Classification>, is a new classification method proposed by the author, it combines NB and SVM, The principle can be summed up in a sentence:

"Trust NB unless SVM is very confident"

The authors have experimented on different datasets, and have made better results than individual NB and SVM, and the results and principles can be found in the original paper.

The author also provides the MATLAB code, this article will use Python to carry on the simple demo implementation to the NBSVM.

Problem description

To achieve a small Chinese news text classification problem. The training set data consists of 11 kinds of news, the text has already done participle, each kind of news has 1600, corresponding label from 1 to 11. The test set consists of 11*160 data. The experimental data and the results of the comparisons come to self-knowledge column, detailed address see reference.

Algorithm steps

vectorization, TF-IDF feature extraction

 count_v0=  Countvectorizer (); Counts_all = Count_v0.fit_transform (train_x +  test_x); count_v1= Countvectorizer (vocabulary=  Count_  V0.VOCABULARY_);   Counts_train =  count_v1.fit_transform (train_x); Print ("The Shape of Train is" +  repr (counts_train.shape)) Count_v2 = Countvectorizer (vocabulary=  Count_ V0.VOCABULARY_); Counts_test =  count_v2.fit_transform (test_x); print ("The shape of test is" +  repr (counts_test.shape)) tfidft Ransformer =  Tfidftransformer (); train_x =  Tfidftransformer.fit (counts_train). Transform (Counts_train); test_x = Tfidftransformer.fit (counts_test). Transform (Counts_test);

Calculate Log-count-ratio
To simplify the problem, we turn the multiple classification (n) problem into several two classification problems, so we need to convert the label to one-hot encoded form, and the corresponding y_i value is 1 and 0.
```
def PR (x, Y_i, y):     p = x[y==y_i].sum (0) # axis = 0    return (p+1)/((y==y_i). SUM () +1) #正则化 R = Sparse.csr_matrix (NP . log (pr (x,1,y)/PR (x,0,y)))
```
Use x r (elementwise-product) instead of the original training data x in the linear classifier as the training data of the NBSVM.
```
X_NB = x.multiply (r)
```
Using the new training data and training tag, the classifier is trained, and the LR classifier is used here.
```
CLF = Logisticregression (C=1, Dual=false, N_jobs=1). Fit (x_nb, y)
```
The predictive test set also x_test r the data of the test set to the same processing.
```
Clf.predict (X_test.multiply (R))
```

Experimental results

The accuracy rate in the validation set reaches 0.8565. The following table is an experimental effect that other methods classify on the dataset.

As can be seen, NBSVM not only improved the naïve Bayesian and SVM methods of the individual role, and compared with the method of deep learning time less, the effect of silk is no inferior.

Feelings

In the text classification model, the optimal basic classification model often has no definite answer due to the difference of data sets and features. To solve this problem, NBSVM combines the two most common basic classification models by training a weighting factor (which can be regarded as a regularization of the model) to form a basic model with strong versatility and better effect.

The authors follow the application of NBSVM on a larger data set (10w+), and found that the performance of NBSVM can pull apart the performance of NB and SVM, and is better than most shallow deep learning models. Therefore, NBSVM can be used as a good baseline model for text categorization.

NBSVM Complete code

classNbsvmclassifier (baseestimator,classifiermixin):def __init__(Self, c=1.0, Dual=false, N_jobs=1): Self. C=C self.dual=Dual Self.n_jobs=N_jobsdefPredict (self, x):#Verify that model have been fitCheck_is_fitted (Self, ['_r','_CLF'])        returnself._clf.predict (x.multiply (self._r))defPredict_proba (self, x):#Verify that model have been fitCheck_is_fitted (Self, ['_r','_CLF'])        returnSelf._clf.predict_proba (x.multiply (self._r))defFit (self, x, y):#Check that X and Y are correct shapeX, y = check_x_y (x, Y, accept_sparse=True)defPR (x, Y_i, y): P= x[y==y_i].sum ()Print(P)Print((y==y_i). SUM ())return(p+1)/((y==y_i). SUM () +1) Self._r= Sparse.csr_matrix (Np.log (pr (x,1,y)/PR (x,0,y)))#PR (x,1,y) Number of positve samples PR (x,0,y) Number of negatve samplesX_NB =x.multiply (self._r) SELF._CLF= Logisticregression (c=self. C, Dual=self.dual, n_jobs=self.n_jobs). Fit (x_nb, y)returnSelf

Full code Address: Https://github.com/sanshibayuan/NBSVM

Reference:

Https://github.com/sidaw/nbsvm

Https://nlp.stanford.edu/~sidaw/home/projects:nbsvm

https://zhuanlan.zhihu.com/p/26729228

Python implementations of NBSVM

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More