Python implementations of NBSVM

Source: Internet
Author: User
Tags svm

Nbsvm

Naive Bayes (Naive bayers) and Support vector Machine (SVM) are the basic models commonly used in text categorization. Under different data sets, different characteristics and different parameters, the effects of the two are varied. In general, NB is better than SVM in short text, and SVM performs better on long text.

NBSVM from the paper <baselines and bigrams simple, good sentiment and Topic Classification>, is a new classification method proposed by the author, it combines NB and SVM, The principle can be summed up in a sentence:

"Trust NB unless SVM is very confident"

The authors have experimented on different datasets, and have made better results than individual NB and SVM, and the results and principles can be found in the original paper.

The author also provides the MATLAB code, this article will use Python to carry on the simple demo implementation to the NBSVM.

Problem description

To achieve a small Chinese news text classification problem. The training set data consists of 11 kinds of news, the text has already done participle, each kind of news has 1600, corresponding label from 1 to 11. The test set consists of 11*160 data. The experimental data and the results of the comparisons come to self-knowledge column, detailed address see reference.

Algorithm steps

  1. vectorization, TF-IDF feature extraction  
     count_v0=  Countvectorizer (); Counts_all = Count_v0.fit_transform (train_x +  test_x); count_v1= Countvectorizer (vocabulary=  Count_  V0.VOCABULARY_);   Counts_train =  count_v1.fit_transform (train_x); Print ("The Shape of Train is" +  repr (counts_train.shape)) Count_v2 = Countvectorizer (vocabulary=  Count_ V0.VOCABULARY_); Counts_test =  count_v2.fit_transform (test_x); print ("The shape of test is" +  repr (counts_test.shape)) tfidft Ransformer =  Tfidftransformer (); train_x =  Tfidftransformer.fit (counts_train). Transform (Counts_train); test_x = Tfidftransformer.fit (counts_test). Transform (Counts_test); 
  2. Calculate Log-count-ratio

    To simplify the problem, we turn the multiple classification (n) problem into several two classification problems, so we need to convert the label to one-hot encoded form, and the corresponding y_i value is 1 and 0.

    def PR (x, Y_i, y):     p = x[y==y_i].sum (0) # axis = 0    return (p+1)/((y==y_i). SUM () +1) #正则化 R = Sparse.csr_matrix (NP . log (pr (x,1,y)/PR (x,0,y)))
  3. Use x r (elementwise-product) instead of the original training data x in the linear classifier as the training data of the NBSVM.
    X_NB = x.multiply (r)
  4. Using the new training data and training tag, the classifier is trained, and the LR classifier is used here.
    CLF = Logisticregression (C=1, Dual=false, N_jobs=1). Fit (x_nb, y)
  5. The predictive test set also x_test r the data of the test set to the same processing.
    Clf.predict (X_test.multiply (R))

Experimental results

The accuracy rate in the validation set reaches 0.8565. The following table is an experimental effect that other methods classify on the dataset.

As can be seen, NBSVM not only improved the naïve Bayesian and SVM methods of the individual role, and compared with the method of deep learning time less, the effect of silk is no inferior.

Feelings

In the text classification model, the optimal basic classification model often has no definite answer due to the difference of data sets and features. To solve this problem, NBSVM combines the two most common basic classification models by training a weighting factor (which can be regarded as a regularization of the model) to form a basic model with strong versatility and better effect.

The authors follow the application of NBSVM on a larger data set (10w+), and found that the performance of NBSVM can pull apart the performance of NB and SVM, and is better than most shallow deep learning models. Therefore, NBSVM can be used as a good baseline model for text categorization.

NBSVM Complete code

classNbsvmclassifier (baseestimator,classifiermixin):def __init__(Self, c=1.0, Dual=false, N_jobs=1): Self. C=C self.dual=Dual Self.n_jobs=N_jobsdefPredict (self, x):#Verify that model have been fitCheck_is_fitted (Self, ['_r','_CLF'])        returnself._clf.predict (x.multiply (self._r))defPredict_proba (self, x):#Verify that model have been fitCheck_is_fitted (Self, ['_r','_CLF'])        returnSelf._clf.predict_proba (x.multiply (self._r))defFit (self, x, y):#Check that X and Y are correct shapeX, y = check_x_y (x, Y, accept_sparse=True)defPR (x, Y_i, y): P= x[y==y_i].sum ()Print(P)Print((y==y_i). SUM ())return(p+1)/((y==y_i). SUM () +1) Self._r= Sparse.csr_matrix (Np.log (pr (x,1,y)/PR (x,0,y)))#PR (x,1,y) Number of positve samples PR (x,0,y) Number of negatve samplesX_NB =x.multiply (self._r) SELF._CLF= Logisticregression (c=self. C, Dual=self.dual, n_jobs=self.n_jobs). Fit (x_nb, y)returnSelf

Full code Address: Https://github.com/sanshibayuan/NBSVM

Reference:

Https://github.com/sidaw/nbsvm

Https://nlp.stanford.edu/~sidaw/home/projects:nbsvm

https://zhuanlan.zhihu.com/p/26729228

Python implementations of NBSVM

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.