The first is to use film commentary to do emotional analysis, mainly including the following main content (see the last OH):
1. Prepare text data
2. Build feature vectors based on text documents
3. Training machine learning models to differentiate positive comments and negative comments from movie reviews (same for your goddess)
4. Use external memory learning and online learning algorithms to process big data
In this article, we mainly introduce the preparation of the movie review data.
I. Emotional analysis
Sentiment analysis is also called opinion mining (Opinion mining), which is a very popular branch of machine learning in the field of natural language processing (NLP), which is primarily an emotional inclination to analyze documents.
Second, download the data
Please prepare a movie message (or use the chat information directly from you and the goddess)
The film review is a movie review from IMDB, with a total of 50,000 positive comments and negative comments on the film, positive reviews representing more than 6 stars for the film, and negative comments indicating a score of less than 5 stars. 50,000 reviews were divided into four folders train neg and POS and test neg and POS, where each folder contains 12,500 txt movie review files, where Pos represents positive comments and neg represents negative comments. So, we need to integrate these 50,000 txt files into a single table, the table is divided into two columns, the first column represents the content of the comment, the second column indicates whether the comment is positive (in 1) or negative (in 0).
Iii. creating a form file for movie reviews
It takes about 10 minutes to consolidate 50,000 txt files into a single table file. We can visualize the entire process through the Python Pyprind library, and it can estimate the remaining processing time based on the current computer's running state, and after processing is complete, you can see the total elapsed time. Save movie reviews as a CSV file using Python's data Analysis library pandas.
1. Total Estimated processing time
2. Total time of statistics processing
3. Python implementation code
We also need to know some of the preparatory work before converting text to eigenvectors, including the following:
1. Cleaning text data
2. Mark Documents
3. Word Bag Model
First, clean the text data
Cleaning the text requires removing some of the unnecessary characters that are contained in the text.
1. Delete unnecessary characters
Print (data["review"][0][-50:])
Is seven.
Title (Brazil): Not Available
Found comments contain some HTML tags, punctuation marks, and other non-alphabetic symbols. HTML tags have no effect on our sentiment analysis of comments, as punctuation marks can affect the semantics of sentences, and in order to simplify processing we delete punctuation marks, retaining emoticons (such as ":)"), because emoticons are helpful for analyzing sentiment in movie reviews. Below, we remove these unnecessary characters through Python's regular expressions.
Python's regular expressions provide a convenient and efficient way to search strings for specific strings, and regular expressions also have a lot of tricks and methods, and if you're interested in regular expressions, you can get to know them yourself. Here we simply use it, so the regular expression is not described in detail.
2. Mark Documents
For English documents we can use its natural space as the word delimiter, if it is Chinese, you can use some word-breaker such as Jieba participle. In the sentence, we may meet the first "runners", "Run", "Running" word different form, so we need to be extracted by stemming (word
Stemming) to extract the original word. The initial stemming algorithm was proposed by Martin F. Porter in 1979, known as Porter
Stemming algorithm. We can install the Python Natural Language Toolkit NLTK, the official website installs the link: http://www.nltk.org/install.html, in NLTK already is realizes the Porter
The stemming algorithm also implements more advanced snowball Stemmer and Lancaster Stemmer and porter than it does
They are much faster to extract than stemming. Can be installed with the PIP command
Pip Install NLTK
3, the removal of discontinued words
Discontinued words can be broadly divided into two categories, one is functional words, these functional words are very common, compared with other words, functional words have no practical meaning, such as "the", "is", "at", "which" and so on. Another category is lexical words, such as "want" and so on. Discontinued words have no meaning for the sentiment classification of movie reviews, so we need to delete some of the discontinued words. Use the Nltk.download function to get the deactivation words provided by NLTK and to remove the deactivation words from the movie comments using these deactivation words. The NLTK library provides a total of 179 discontinued words, the following gives some of the discontinued words
What else do we need to do?
1. Convert a word to a feature vector
2, TF-IDF calculate the word correlation degree
Before, we already know some text preprocessing and participle. In this article, we mainly describe how to turn categorical data such as words into numeric formats so that we can use machine learning to train the model later.
Convert a word to a eigenvector
Bag-of-words Model: The text is represented as a vector of numeric features. There are two steps to implement a word bag model:
1. Create a unique tag for each word on the entire document set, which contains many documents.
2. Build a feature vector for each document, consisting primarily of the number of occurrences of each word on the document.
Note: Because the number of words that appear in each document is only a small part of the entire document set, there are many words that do not appear and are marked as 0. Therefore, most of the elements in the eigenvectors will be 0, resulting in a sparse matrix.
The following is a word bag model implemented by Sklearn's countvectorizer to convert the document into a eigenvector.
By Count.vocabulary_ we can see the index position of each word, each sentence is composed of a 6-dimensional eigenvector. Where the first column index is 0, corresponding to the word "and", "and" in the first and two sentences did not appear, so 0, in the third sentence appeared in some, so 1. The values in the eigenvectors are also known as the original word frequency (raw
Term frequency) shorthand for TF (T,D), which indicates the number of occurrences of the word T in document D.
Note: In the above word bag model, we use a single word to construct the word vector, which is called a 1-tuple (1-gram) or unit group (UNIGRAM) model. In addition to the unary group, we can also construct N-tuples (N-gram). n values in an n-tuple model are related to a particular scenario, such as an n-tuple with a value of N of 3 or 4 in anti-spam to achieve better results. The following is an example of an n-tuple, as in the phrase "The weather is Sweet", the 1-tuple: "The", "Weather", "is", "Sweet". 2-tuple: "The Weather", "weather is", "is Sweet". In Sklearn, you can set the Ngram_range parameter in Countvecorizer to build a different n-tuple model, the default ngram_range=. Sklearn Building 2 tuples through Countvecorizer
Second, TF-IDF calculate the word correlation degree
You may encounter a problem when using the above method to construct a word vector: a word is present in different types of documents, and this type of Word does not have the ability to differentiate between document types. We use the TF-IDF algorithm to construct the word vectors to overcome this problem.
Word frequency-inverse document frequencies (tf-idf,term frequency-inverse documents frequency): TF-IDF can be defined as Word frequency x inverse document frequencies
where TF (T,D) represents the number of occurrences of the word T in document D, IDF (T,D) is the inverse of the document frequency and the formula is calculated as follows
where ND represents the total number of documents, and DF (T,D) represents the number of document D containing the word T. A constant of 1 is added to the denominator to prevent DF (t,d) =0, resulting in a denominator of 0. The purpose of log is to ensure that when DF (t,d) is very small, it does not cause the IDF (T,D) to be too large.
Calculate TF-IDF by Sklearn's Tfidftransformer and Countvectorizer
It can be found that "is" (second column) and "The" (sixth column), which appear in three sentences, they do not provide a lot of information about the classification of documents, so their TF-IDF value is relatively small.
Note: The TF-IDF calculation of Tfidftransformer in Sklearn differs from the formula TF-IDF defined above, Sklearn's TF-IDF calculation formula
Usually before calculating the TF-IDF, the original word frequency tf (t,d) is normalized, and the tfidftransformer is directly normalized to TF-IDF. The Tfidftransformer default uses the L2 normalization, which is based on the ratio of the L2 norm to the non-normalized eigenvectors, making the return vector length 1, and the formula is as follows:
The following is an example of the calculation of the Tfidftransformer TF-IDF in Sklearn, with the first sentence above "The sun is shining" as an example
1. Calculate the original word frequency
A, the corresponding subscript of the word
b, the original word frequency tf (t,d) to calculate the third sentence
c, calculate inverse document frequency IDF (T,D)
Note: The other words in the calculation tf-idf are 0, because the original word frequency is 0, so there is no need to calculate the IDF, log is the natural number e as the base.
D, Calculate TF-IDF
So, the TF-IDF eigenvector for the first sentence is [0,1,1.29,1.29,0,1,0]
E, TF-IDF's L2 normalization
In the next article, we will describe how to use the eigenvectors of these sentences to construct a model for emotional classification. Although the actual presentation is a film review, but the training is good but can be used to deal with your chat information Oh ~ ~ ~
Do you want to test your goddess's mood when you talk? Hold Your hand to succeed!!!
In addition, small series have their own learning Exchange group if you want to learn, you can come together to exchange: 719+139+688, whether you are small white or Daniel, small series are welcome, and small series in the group will not regularly share the dry, including a small series of their own finishing a 2018 of the latest learning materials and the great God to share live, Welcome beginner and advanced in the small partner
I use Python for emotional analysis, to let the programmer and Goddess hold hands successfully