Natural language Analysis--Experimental records

Last Update:2018-08-19 Source: Internet

Author: User

Tags idf

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

1. First experiment: Naive Bayesian, using statistical Word number method to process data Conutvectorizer ()

(1) Training set 12,695,

Forward: 8274

Negative direction: 4221

Hit Stop glossary: df=3, the correct rate of 0.899,

Df=1, correct rate 0.9015

Sichuan University Stop Glossary: df=1, correct rate 0.90035

(2) Training set 19,106

Forward: 11747

Negative direction: 7359

Hit Stop glossary: df=1, correct rate 0.90153

2. Second experiment: Naive Bayesian, using IDF to process comment data Tfidfvectorizer (), while Tfidftransformer () used an error.

(1) Training set 19,106

Forward: 11747

Negative direction: 7359

Hit Stop glossary: df=1, correct rate 0.899568

3. The third experiment: Naive Bayes, using the number of statistical terms to process data conutvectorizer ()

Training Set 19,106 article

Forward: 11747

Negative direction: 7359

Hit Stop glossary: df=1

(1) When the data vectorization, the use of the two-yuan model, Conutvectorizer (ngram_range=), there is a memeoryerror error, that is, insufficient memory. Search the reason is the cause of the computer, and then try to run with the server.

Still the one-dimensional model training set unchanged, the test centralized classification of the wrong change, and the statement blurred text deletion, the correct rate is raised a little.

Correct rate: 0.9030612244897959

(2) Improved accuracy by 0.006 when replacing word segmentation with precise mode instead of full mode

Correct rate: 0.908948194662

4. The four-time test

(1) The training test set is unchanged, the change Countvectorizer can count the length of the word 1 two times after the experiment, The Bayesian correctness rate of statistical frequency is 0.905, and with TFIDF Bayesian, the correct rate is reduced to about 0.76, a significant decline, may be because of a word too much, and the total training data is not enough, resulting in the value of the IDF will be relatively small, and the word frequency of single words is very large, so the single word word TFIDF value is larger, seriously affect The importance of each word distribution, so the results of experimental classification is very poor.

(2) Training test set unchanged, naive Bayesian, using the number of statistical word processing data conutvectorizer (), the correct rate of 0.9116954474097331

Training test set unchanged, naive Bayesian, using TFIDF data, the correct rate of 0.9030612244897959.

5. Experiment Five

(1) Logistic regression Sklearn.linear_model was adopted. Logisticregression () using the default regularization is penalty= ' L2 ', the optimization method solver= ' liblinear ' when the correct rate is 0.9072691552062868

(2) Logistic regression Sklearn.linear_model was adopted. Logisticregression () using L1 regularization penalty= ' L1 ', the optimization method solver= ' liblinear ', the correct rate of 0.9084479371316306, slightly higher than the above.

(3) Logistic regression Sklearn.linear_model was adopted. Logisticregression () using the default regularization is penalty= ' L2 ', the optimization method solver= ' Lbfgs ' when the correct rate is 0.9072691552062868

(4) Logistic regression Sklearn.linear_model was adopted. Logisticregression () using the default regularization is penalty= ' L2 ', the optimization method solver= ' NEWTON-CG ' when the correct rate is 0.9072691552062868

(5) Logistic regression Sklearn.linear_model was adopted. Logisticregression () uses the default regularization that is penalty= ' L2 ', the optimization method solver= ' sag ' when the correct rate is 0.906483300589391. The prediction accuracy is the worst, probably because this optimization method requires a large training data, generally more than 100,000.

6. Experiment Six

(1) using SVM to train the model, the training test data are all drinking the same, but because of the long calculation time, take the training set of the first 10,000 data to do training, the test set is unchanged, the correct rate is 0.6742632612966601

Time is 1170.129 seconds

(2)

Natural language Analysis--Experimental records

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More