The analysis of the emotion bias in the natural language processing of real-

The analysis of the emotion bias in the natural language processing of real-_NLP

Last Update:2018-08-23 Source: Internet

Author: User

Tags nltk

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

A very important research direction in natural language processing (NLP) is semantic affective analysis (sentiment). For example, there are a lot of comments about movies on the IMDB, so we can evaluate the reputation of a movie by sentiment analysis, if it's just released, and even predict whether it can make a box-office hit. Similar to this, the domestic watercress also has a lot of film and television works or book comments on the content can also be used as an emotional analysis of the corpus. For those e-commerce sites, for a certain commodity, we can also see the message area a large number of evaluation content, then the same kind of goods, which products are most favored by consumers. Perhaps the emotional analysis of commodity reviews can tell us the answer.

In the previous article, the author has introduced to the reader some basic methods of natural language processing using NLTK in Python: using NLTK in Python natural language processing Python Natural language processing: stemming, morphological and maxmatch algorithms

At the same time, I also introduced the use of Scikit-learn in Python machine learning, especially the use of logisticregression classification of the basic method: Python machine learning Logistic regression

The following article, based on these articles, attempts to combine machine learning with natural language processing to illustrate the basic approach to sentiment analysis with a tweet as an example. The first thing to note is that the content has three points:

1 The following example still uses the main library of NLTK and Scikit-learn two functions in Python.

2 Semeval is an annual event with a competitive nature in the field of NLP, similar to Kdd-cup. Semeval was founded in 1998, this year (2016) the Activity homepage is http://alt.qcri.org/semeval2016/, the data used in the following program is from Semeval 2016 Task (of course we have completed the basic preprocessing process when we use it, and this is not the focus of this article, we omit the table).

3 The main purpose of our presentation is to help you familiarize yourself with the basic content of sentiment analysis, to deepen the use of the Scikit-learn function library, and the data we analyze comes from the actual dataset, not the analog dataset, Therefore, the final analysis results do not guarantee a very high accuracy rate. To get a higher accuracy rate, we need to do more deep thinking on the model construction and feature selection. And these "thoughts" have gone beyond the scope of Ben Boven's discussion.

Our raw data is a single tweet, for example: Top 5 most searched for Back-to-school topics-the list may surprise you http://t.co/Xj21uMVo0p @bing @MSFTnews #backtoschool @Microsoft @taehongmin1 We have a IOT workshop by @Microsoft in 11PM on the Friday-defini Tely worth going for inspiration! #HackThePlanet

Of course, we also have a list of labels, a label that evaluates each tweet's polarity, where +1 means positive, and-1 indicates negative,0 neutral.

In the preprocessing phase, I have the clause and participle of each tweet, and then: 1 excludes the @*** such content; 2 for the #-guided topic, we treat it as a separate sentence; 3 deletes the network address that is booted by HTTP; 4) unified case. So the above two tweets will get the following two results
[[' Top ', ' 5 ', ' most ', ' searched ', ' for ', ' back ', '-', ' to ', '-', ' school ', ' topics ', '-', ' ', ' ', ', ', ', ', ' surprise ', ' and ', '. ', [' Back ', ' to ', ' school ', '. ']] [[' we ', ' have ', ' a ', ' IoT ', ' Workshop ', ' by ', ' at ', ' 11pm ', ' on ', ' The ', ' Friday ', '-', ' definitely ', ' worth ', ' going ', ' For ', ' inspiration ', '! '], ['. '], [' Hack ', ' the ', ' Planet ', '. ']

Then we create a word bag (Bow,bag-of-word) based on the training dataset, which is a dictionary that stores all the words that appear in the training data set and the frequency with which they appear in the full text. The aim is to eliminate those words (uncommon words) that rarely appear in all training data sets, and those that are frequently but meaningless (usually called stop words, such as the, A, etc.).

On top of the bow basis, you can create feature dictionaries for each tweet. A feature dictionary is a dictionary of words that appear in bow in each tweet (that is, it excludes rare and obscure words and stops) and the frequency with which they appear in the tweet. {'-': 2, '--': 1, '. ': 2, ' 5 ': 1, ' Back ': 2, ' list ': 1, ' May ': 1, ' School ': 2, ' searched ': 1, ' surprise ': 1, ' to Pics ': 1}
{'! ': 1, '-': 1, '. ': 2, ' 11pm ': 1, ' definitely ': 1, ' Friday ': 1, ' going ': 1, ' hack ': 1, ' inspiration ': 1, ' IoT ': 1, ' the PLA NET ': 1, ' Workshop ': 1, ' worth ': 1}

So far, all the preprocessing work has been done. We obtained a training dataset (and its corresponding list of labels) in form a list of dicts, and a Test dataset (and its corresponding list of labels) in the form of a list of dicts. But now the problem is that this form of data is clearly not going to be used directly. Recall that the iris dataset we used in the previous article on logistic regression is not hard to find in the form of the data we have today. The feature extraction (Feature extraction) module provided in the Scikit-learn is necessary.

The Sklearn.feature_extraction module can is used to extract features in a format supported by machine learning from datasets consisting of formats such as text and image.

More directly we will have the aid of the function is Dictvectorizer,the class Dictvectorizer can be used to convert feature arrays represented as lists of stand ARD Python Dict objects to the numpy/scipy representation used by Scikit-learn estimators.

If you are confused by the descriptions in the Scikit-learn document, the following example will make it easy to understand the role. First, we give a prototype of its definition:

Class Sklearn.feature_extraction. Dictvectorizer (Dtype=<class ' Numpy.float64 ', separator= ' = ', Sparse=true, Sort=true)

Where sparse is a Boolean-type parameter that indicates whether to convert the result to scipy.sparse matrices, a sparse matrix, by default, which is assigned the value TRUE.

To see an example, measurements is a list of dicts, we convert it to a matrix representation, when the corresponding position appears in a city name, the corresponding row of the column is set to 1, otherwise it is 0.

>>> from sklearn.feature_extraction import dictvectorizer
>>> measurements = [
	{' City ': ' Dubai ', ' temperature ': The ' city
	': ' London ', ' temperature ': ', ', ' the ' City
	': ' San fransisco ', ' temperature ':
>>> VEC = Dictvectorizer ()
>>> vec.fit_transform (measurements). ToArray ()
Array ([[  1.,   0.,   0,,  1.],
       [  0.,   0.,   0.,  ],
       [  .,   0.,   1., a  .]]
>>> vec.get_feature_names ()
[' City=dubai ', ' City=london ', ' City=san fransisco ', ' temperature ']

One additional example.

>>> measurements = [
	{' City=dubai ': true, ' City=london ': true, ' temperature ': N.},
	{' City=london ': True, ' City=san Fransisco ': true, ' temperature ':
	{' City ': ' San fransisco ', ' temperature ': '.},]
> >> Vec.fit_transform (measurements). ToArray ()
Array ([[  1.,   1.,   0,,,  .],
       [  0.,   1.,   1., a  .],
       [  0.,   0.,   1.,  18.]]

Another common problem is that the training dataset and the test dataset's dictionary size are inconsistent, and at this point we hope that the short one will be able to catch the long one by filling the zero. Then you need to use transform. Let's look at the example:

>>> D = [{' foo ': 1, ' Bar ': 2}, {' foo ': 3, ' Baz ': 1}]
>>> v = dictvectorizer (sparse=false)
>&G t;> x = V.fit_transform (D)
>>> X
Array ([[2.,  0.,  1.],
       [0.,  1.,  3.]])
>>> v.transform ({' foo ': 4, ' Unseen_feature ': 3})
array ([[0.,  0.,  4.]])
>>> v.transform ({' foo ': 4})
array ([[0.,  0.,  4.]]

Visible when transform is used, the one behind can always implement the same dimension as the previous one. Of course, this leveling can be padded, or can be cut, so usually, we are made up short of such a way to achieve the dimension of consistency. If you do not use transform, but continue to fit_transform, you will get the following results (this obviously does not meet our requirements)

>>> v.fit_transform ({' foo ': 4, ' Unseen_feature ': 3})
Array ([[4.,  3.]])

With this understanding, we can build a sparse matrix for our subsequent logistic regression, as follows

VEC = Dictvectorizer ()
Sparse_matrix_tra = Vec.fit_transform (Feature_dicts_tra)
Sparse_matrix_dev = Vec.transform (Feature_dicts_dev)

Of course, here you can also use the following code to test whether their dimensions are as we expected.

Print (Sparse_matrix_dev.shape)
print (Sparse_matrix_tra.shape)

We can then use the previously used logistic regression to build the classification model.

From Sklearn import linear_model

logreg = Linear_model. Logisticregression (C = 1)
logreg.fit (Sparse_matrix_tra, labels_t)
prediction = Logreg.predict (Sparse_matrix _dev) print (
logreg) print (
"Accuracy score:") Print (
accuracy_score (labels_d, prediction))
print ( Classification_report (Labels_d, prediction))

Take a look at the predicted results of the model for the test set

Logisticregression (C=1, Class_weight=none, Dual=false, Fit_intercept=true, intercept_scaling=1,
          max_iter=100, Multi_class= ' OVR ', N_jobs=1,
          penalty= ' L2 ', Random_state=none, solver= ' liblinear ', tol=0.0001, verbose=0
          , Warm_start=false)
accuracy score: 
0.512848551121
             Precision

         recall F1-score Support -1       0.41      0.28      0.33       360
          0       0.46      0.69      0.55
          1       0.68      0.46      0.55       769

avg/total       0.54 0.51 0.51      1829

The accuracy rate of the sentiment classification model is 51.28%. Of course, as we said earlier, there is obviously much room for improvement in this model. You can improve the accuracy of the model by introducing new feature, or by using other machine learning models (or adjusting the model parameters). But the purpose of this article is to demonstrate the basic steps and strategies for sentiment analysis in NLP, as well as to further demonstrate the original intent of using Scikit learn for more extensive machine learning, such as dictionary-based feature extraction and the introduction of sparse matrices. Interested readers can continue to optimize the model on this basis, with a view to achieving a more accurate classification capability.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More