Text Classification feature description vsm and bow, text classification vsmbow

Last Update:2016-06-10 Source: Internet

Author: User

Tags idf

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Text Classification feature description vsm and bow, text classification vsmbow

When we try to use statistical machine learning to solve text-related problems, the first problem to be solved is if a text sample is displayed on the computer. A classic and widely used text representation method, namely, vector space model (VSM), commonly known as the bag-of-words model ".

First, let's take a look at how the vector space model represents a text:

The spatial vector model requires a "Dictionary": a set of feature words in the sample set of the text. This dictionary can be generated in the sample set or imported from outside. The dictionary in the dictionary is [baseball, specs, graphics ,..., space, quicktime, computer].

A text can be expressed with a dictionary. First, define a vector with the same length as the dictionary. Each position in the vector corresponds to the word at the corresponding position in the dictionary. For example, the first word baseball in the dictionary corresponds to the first position in the vector. Then, traverse the text, find a word in the text, and fill in "A value" in the corresponding position in the vector ".

In fact, "a value" is the current Term Weight. Currently, there are four types of feature Weight:

Bool (presence)

Indicates whether a word appears in a document. If it appears, it is recorded as 1. If it is negative, it is recorded as 0.

Term frequency (TF)

Indicates the number of times a word appears in the text (the weight used in the text). The more a feature word appears in a text, the greater its contribution to the sample.

Inverse document frequency (IDF)

Document frequency indicates the document frequency when feature words appear in the dataset. The lower the frequency of a Word document, the more easily these documents will be captured.

TF-IDF

TF-IDF integrates the properties of the above two feature weights.

In the documents on "education", "universities", "Students" and other words appear frequently. In the documents on "Sports", "Competitions ", "contestants" appear frequently. Using TF weights, it is reasonable that these feature words have a higher weight (Term frequency ). However, some words, such as "these", "yes", and "of", also have high word frequency, but the degree of importance is obviously not, "universities", "Students", "Competitions", and "contestants" are important. However, the words "these", "yes", and "of" are often relatively low IDF, which makes up for the defects of TF. Therefore, TF-IDF weight is widely used in traditional text classification and information retrieval.

Although TF-IDF weight has a very wide range of applications, not all text weight using TF-IDF will have better performance. For example, in terms of Sentiment Classification, BOOL-type weights often have good performance (many papers on Sentiment Classification use BOOL weights ).

Now, let's go back to the vector space model raised at the beginning of the article. Each Feature Word is independent of each other based on the vector space model representation method. Due to this simple expression, the research work on text classification was promoted at the beginning, but with the passage of time, the traditional vector space model often limits the development of certain fields (such as Sentiment Classification) because it discards the word order, syntax, and part of semantic information, and becomes a bottleneck affecting performance. The current solutions include:

Use N-Gram syntax features
Take syntax and semantic information into account in the classification task
Model improvement...

Finally, we will introduce the text Representation Method in sklearn and use it to implement a simple text classification.

The dataset we use is the movie_reviews corpus (emotional classifier task ). The dataset is organized by storing a text file and files with the same tag in the same folder. The dataset structure is as follows:

Movie_reviews \

Pos \

Cv000_29590.txt, cv00000018431.txt..cv999_13106.txt

Neg \

Cv000_29416.txt, cv00000019502.txt, cv999_14636.txt

In sklearn, sklearn. datasets. load_files can be used to load a dataset of this structure. After data is loaded, you can use the VSM described earlier to Represent Text samples.

Sklearn provides a specific text feature extraction module: sklearn. feature_extraction.text, which converts a text sample into a bag of words. CountVectorizer corresponds to the word frequency weight or BOOL weight (adjusted by the binary parameter) vector space model. TfidfVectorizer provides a vector space model under the Tfidf weight. Sklearn provides them with a large number of parameters (all parameters also provide default parameters), with high flexibility and practicality.

The movie_reviews corpus uses the sklearn text representation method and the Multinomial Naive Bayes classifier for sentiment classification. The Code is as follows:

#! /Usr/bin/env python # coding = gbkimport osimport sysimport numpy as npfrom sklearn. datasets import load_filesfrom sklearn. cross_validation import train_test_splitfrom sklearn. feature_extraction.text import CountVectorizerfrom sklearn. naive_bayes import into text_classifly (dataset_dir_name): # load a dataset, split the dataset into 80% for training, and test movie_reviews = load_files (Files) Train, train, train, doc_class_test = train_test_split, test_size = 0.2) # vector space model under BOOL features. Note that the test sample calls the transform interface count_vec = CountVectorizer (binary = True) doc_train_bool = count_vec.fit_transform (doc_terms_train) doc_test_bool = count_vec.transform (doc_terms_test) # Call the MultinomialNB classifier clf = MultinomialNB (). fit (doc_train_bool, doc_class_train) doc_class_predicted = clf. predict (doc_test_bool) print 'accuracy: ', np. mean (doc_class_predicted = doc_class_test) if _ name _ = '_ main _': dataset_dir_name = sys. argv [1] text_classifly (dataset_dir_name)

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More