Natural language Analysis--Experimental records

Source: Internet
Author: User
Tags idf

1. First experiment: Naive Bayesian, using statistical Word number method to process data Conutvectorizer ()

(1) Training set 12,695,

Forward: 8274

Negative direction: 4221

Hit Stop glossary: df=3, the correct rate of 0.899,

Df=1, correct rate 0.9015

Sichuan University Stop Glossary: df=1, correct rate 0.90035

(2) Training set 19,106

Forward: 11747

Negative direction: 7359

Hit Stop glossary: df=1, correct rate 0.90153

2. Second experiment: Naive Bayesian, using IDF to process comment data Tfidfvectorizer (), while Tfidftransformer () used an error.

(1) Training set 19,106

Forward: 11747

Negative direction: 7359

Hit Stop glossary: df=1, correct rate 0.899568

3. The third experiment: Naive Bayes, using the number of statistical terms to process data conutvectorizer ()

Training Set 19,106 article

Forward: 11747

Negative direction: 7359

Hit Stop glossary: df=1

(1) When the data vectorization, the use of the two-yuan model, Conutvectorizer (ngram_range=), there is a memeoryerror error, that is, insufficient memory. Search the reason is the cause of the computer, and then try to run with the server.

Still the one-dimensional model training set unchanged, the test centralized classification of the wrong change, and the statement blurred text deletion, the correct rate is raised a little.

Correct rate: 0.9030612244897959

(2) Improved accuracy by 0.006 when replacing word segmentation with precise mode instead of full mode

Correct rate: 0.908948194662

4. The four-time test

(1) The training test set is unchanged, the change Countvectorizer can count the length of the word 1 two times after the experiment, The Bayesian correctness rate of statistical frequency is 0.905, and with TFIDF Bayesian, the correct rate is reduced to about 0.76, a significant decline, may be because of a word too much, and the total training data is not enough, resulting in the value of the IDF will be relatively small, and the word frequency of single words is very large, so the single word word TFIDF value is larger, seriously affect The importance of each word distribution, so the results of experimental classification is very poor.

(2) Training test set unchanged, naive Bayesian, using the number of statistical word processing data conutvectorizer (), the correct rate of 0.9116954474097331

Training test set unchanged, naive Bayesian, using TFIDF data, the correct rate of 0.9030612244897959.

5. Experiment Five

(1) Logistic regression Sklearn.linear_model was adopted. Logisticregression () using the default regularization is penalty= ' L2 ', the optimization method solver= ' liblinear ' when the correct rate is 0.9072691552062868

(2) Logistic regression Sklearn.linear_model was adopted. Logisticregression () using L1 regularization penalty= ' L1 ', the optimization method solver= ' liblinear ', the correct rate of 0.9084479371316306, slightly higher than the above.

(3) Logistic regression Sklearn.linear_model was adopted. Logisticregression () using the default regularization is penalty= ' L2 ', the optimization method solver= ' Lbfgs ' when the correct rate is 0.9072691552062868

(4) Logistic regression Sklearn.linear_model was adopted. Logisticregression () using the default regularization is penalty= ' L2 ', the optimization method solver= ' NEWTON-CG ' when the correct rate is 0.9072691552062868

(5) Logistic regression Sklearn.linear_model was adopted. Logisticregression () uses the default regularization that is penalty= ' L2 ', the optimization method solver= ' sag ' when the correct rate is 0.906483300589391. The prediction accuracy is the worst, probably because this optimization method requires a large training data, generally more than 100,000.

6. Experiment Six

(1) using SVM to train the model, the training test data are all drinking the same, but because of the long calculation time, take the training set of the first 10,000 data to do training, the test set is unchanged, the correct rate is 0.6742632612966601

Time is 1170.129 seconds

(2)

Natural language Analysis--Experimental records

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.