Text Classification feature description vsm and bow, text classification vsmbow
When we try to use statistical machine learning to solve text-related problems, the first problem to be solved is if a text sample is displayed on the computer. A classic and widely used text representation method, namely, vector space model (VSM), commonly known as the bag-of-words model ".
First, let's take a look at how the vector space model represents a text:
The spatial vector model requires a "Dictionary": a set of feature words in the sample set of the text. This dictionary can be generated in the sample set or imported from outside. The dictionary in the dictionary is [baseball, specs, graphics ,..., space, quicktime, computer].
A text can be expressed with a dictionary. First, define a vector with the same length as the dictionary. Each position in the vector corresponds to the word at the corresponding position in the dictionary. For example, the first word baseball in the dictionary corresponds to the first position in the vector. Then, traverse the text, find a word in the text, and fill in "A value" in the corresponding position in the vector ".
In fact, "a value" is the current Term Weight. Currently, there are four types of feature Weight:
Indicates whether a word appears in a document. If it appears, it is recorded as 1. If it is negative, it is recorded as 0.
Indicates the number of times a word appears in the text (the weight used in the text). The more a feature word appears in a text, the greater its contribution to the sample.
- Inverse document frequency (IDF)
Document frequency indicates the document frequency when feature words appear in the dataset. The lower the frequency of a Word document, the more easily these documents will be captured.
TF-IDF integrates the properties of the above two feature weights.
In the documents on "education", "universities", "Students" and other words appear frequently. In the documents on "Sports", "Competitions ", "contestants" appear frequently. Using TF weights, it is reasonable that these feature words have a higher weight (Term frequency ). However, some words, such as "these", "yes", and "of", also have high word frequency, but the degree of importance is obviously not, "universities", "Students", "Competitions", and "contestants" are important. However, the words "these", "yes", and "of" are often relatively low IDF, which makes up for the defects of TF. Therefore, TF-IDF weight is widely used in traditional text classification and information retrieval.
Although TF-IDF weight has a very wide range of applications, not all text weight using TF-IDF will have better performance. For example, in terms of Sentiment Classification, BOOL-type weights often have good performance (many papers on Sentiment Classification use BOOL weights ).
Now, let's go back to the vector space model raised at the beginning of the article. Each Feature Word is independent of each other based on the vector space model representation method. Due to this simple expression, the research work on text classification was promoted at the beginning, but with the passage of time, the traditional vector space model often limits the development of certain fields (such as Sentiment Classification) because it discards the word order, syntax, and part of semantic information, and becomes a bottleneck affecting performance. The current solutions include:
- Use N-Gram syntax features
- Take syntax and semantic information into account in the classification task
- Model improvement...
Finally, we will introduce the text Representation Method in sklearn and use it to implement a simple text classification.
The dataset we use is the movie_reviews corpus (emotional classifier task ). The dataset is organized by storing a text file and files with the same tag in the same folder. The dataset structure is as follows:
Movie_reviews \
Pos \
Cv000_29590.txt, cv00000018431.txt..cv999_13106.txt
Neg \
Cv000_29416.txt, cv00000019502.txt, cv999_14636.txt
In sklearn, sklearn. datasets. load_files can be used to load a dataset of this structure. After data is loaded, you can use the VSM described earlier to Represent Text samples.
Sklearn provides a specific text feature extraction module: sklearn. feature_extraction.text, which converts a text sample into a bag of words. CountVectorizer corresponds to the word frequency weight or BOOL weight (adjusted by the binary parameter) vector space model. TfidfVectorizer provides a vector space model under the Tfidf weight. Sklearn provides them with a large number of parameters (all parameters also provide default parameters), with high flexibility and practicality.
The movie_reviews corpus uses the sklearn text representation method and the Multinomial Naive Bayes classifier for sentiment classification. The Code is as follows:
#! /Usr/bin/env python # coding = gbkimport osimport sysimport numpy as npfrom sklearn. datasets import load_filesfrom sklearn. cross_validation import train_test_splitfrom sklearn. feature_extraction.text import CountVectorizerfrom sklearn. naive_bayes import into text_classifly (dataset_dir_name): # load a dataset, split the dataset into 80% for training, and test movie_reviews = load_files (Files) Train, train, train, doc_class_test = train_test_split, test_size = 0.2) # vector space model under BOOL features. Note that the test sample calls the transform interface count_vec = CountVectorizer (binary = True) doc_train_bool = count_vec.fit_transform (doc_terms_train) doc_test_bool = count_vec.transform (doc_terms_test) # Call the MultinomialNB classifier clf = MultinomialNB (). fit (doc_train_bool, doc_class_train) doc_class_predicted = clf. predict (doc_test_bool) print 'accuracy: ', np. mean (doc_class_predicted = doc_class_test) if _ name _ = '_ main _': dataset_dir_name = sys. argv [1] text_classifly (dataset_dir_name)