1. First experiment: Naive Bayesian, using statistical Word number method to process data Conutvectorizer ()
(1) Training set 12,695,
Forward: 8274
Negative direction: 4221
Hit Stop glossary: df=3, the correct rate of 0.899,
Df=1, correct rate 0.9015
Sichuan University Stop Glossary: df=1, correct rate 0.90035
(2) Training set 19,106
Forward: 11747
Negative direction: 7359
Hit Stop glossary: df=1, correct rate 0.90153
2. Second experiment: Naive Bayesian, using IDF to process comment data Tfidfvectorizer (), while Tfidftransformer () used an error.
(1) Training set 19,106
Forward: 11747
Negative direction: 7359
Hit Stop glossary: df=1, correct rate 0.899568
3. The third experiment: Naive Bayes, using the number of statistical terms to process data conutvectorizer ()
Training Set 19,106 article
Forward: 11747
Negative direction: 7359
Hit Stop glossary: df=1
(1) When the data vectorization, the use of the two-yuan model, Conutvectorizer (ngram_range=), there is a memeoryerror error, that is, insufficient memory. Search the reason is the cause of the computer, and then try to run with the server.
Still the one-dimensional model training set unchanged, the test centralized classification of the wrong change, and the statement blurred text deletion, the correct rate is raised a little.
Correct rate: 0.9030612244897959
(2) Improved accuracy by 0.006 when replacing word segmentation with precise mode instead of full mode
Correct rate: 0.908948194662
4. The four-time test
(1) The training test set is unchanged, the change Countvectorizer can count the length of the word 1 two times after the experiment, The Bayesian correctness rate of statistical frequency is 0.905, and with TFIDF Bayesian, the correct rate is reduced to about 0.76, a significant decline, may be because of a word too much, and the total training data is not enough, resulting in the value of the IDF will be relatively small, and the word frequency of single words is very large, so the single word word TFIDF value is larger, seriously affect The importance of each word distribution, so the results of experimental classification is very poor.
(2) Training test set unchanged, naive Bayesian, using the number of statistical word processing data conutvectorizer (), the correct rate of 0.9116954474097331
Training test set unchanged, naive Bayesian, using TFIDF data, the correct rate of 0.9030612244897959.
5. Experiment Five
(1) Logistic regression Sklearn.linear_model was adopted. Logisticregression () using the default regularization is penalty= ' L2 ', the optimization method solver= ' liblinear ' when the correct rate is 0.9072691552062868
(2) Logistic regression Sklearn.linear_model was adopted. Logisticregression () using L1 regularization penalty= ' L1 ', the optimization method solver= ' liblinear ', the correct rate of 0.9084479371316306, slightly higher than the above.
(3) Logistic regression Sklearn.linear_model was adopted. Logisticregression () using the default regularization is penalty= ' L2 ', the optimization method solver= ' Lbfgs ' when the correct rate is 0.9072691552062868
(4) Logistic regression Sklearn.linear_model was adopted. Logisticregression () using the default regularization is penalty= ' L2 ', the optimization method solver= ' NEWTON-CG ' when the correct rate is 0.9072691552062868
(5) Logistic regression Sklearn.linear_model was adopted. Logisticregression () uses the default regularization that is penalty= ' L2 ', the optimization method solver= ' sag ' when the correct rate is 0.906483300589391. The prediction accuracy is the worst, probably because this optimization method requires a large training data, generally more than 100,000.
6. Experiment Six
(1) using SVM to train the model, the training test data are all drinking the same, but because of the long calculation time, take the training set of the first 10,000 data to do training, the test set is unchanged, the correct rate is 0.6742632612966601
Time is 1170.129 seconds
(2)
Natural language Analysis--Experimental records