In recent days to see a text categorization program, write the specific implementation process. Sometimes I see the algorithm, I feel very clear, but I do not know when to achieve. This time from a practical process, may be able to better understand.
The first is the training dataset and the test data set. One line per document, mainly including <class>1</class><title>asdfgh</title><content>asdfghjkl</content > and other projects. The program reads by line and reads one document per line.
Feature selection is divided into the following several steps:
1. Set the dimension of eigenvector: generally 3000 is preferred;
2. Processing of training data;
3. Load stop words;
4. The processing of different feature selection algorithms;
4. Whether to check certain words (. what);
5. The characteristic item of output selection;
is mainly the second step of preprocessing:
1. First read the file by row, each line is a document, doc; Note that the following items are processed for one document, and then loop through each document until the end of the file;
2. Handle each document and extract each item, including label, within the class tag. Title and content, etc.;
3. If the Chinese data set, call the word segmentation module, English data set directly with the character stream processing. The result of the participle is kept in a vector;
4. The next step is crucial to generating a global dictionary:
(1) The word segmentation result is saved in a map variable, map<string, uint> wordsmap; The first term is the word, the second is the number of occurrences; at the same time, the variable wordsnum is defined to record the total number of words appearing in the document;
(2) Save the total number of documents and the number of words in each category, and define a structure that contains three variables, one class mark, one total number of documents, one total number of words, and one global vector;
(3) This step is important to the information that is processed, generate global dictionary: The first item of this map is the corresponding term, and the second is the vector of a struct, which includes the class name, the number of documents, and the number of times the word appears; Because a term may appear in multiple classes, it is stored with a vector. The length of the vector is which class the term belongs to.
The global dictionary generated in 4 (3) is very important, and the following algorithms are all used in this information. Cond......