Overview
Automatic text categorization (Automatic text categorization), or simply text categorization, refers to a computer's process of attributing an article to a given class or category.
Text categorization refers to defining a category for each document in a document collection, according to a predefined category of topics. Text categorization is an important part of text mining.
The so-called text classification, refers to the text given, given a predefined one or more categories of labels, the text is accurate and efficient classification. It is an important part of many data management tasks.
Text categorization refers to classifying documents according to pre-specified criteria so that users can not only easily browse through the document but also use categories to query the documents they need.
Text classification refers to the process of automatically determining the text category in the content of a given classification system, according to the text semantic element is the atom in the statistical semantic method. The smallest unit of the current text cut, in the text category, the semantic element is the word.
Text categorization refers to the process of automatically determining text categories based on text content under a given classification system. Prior to the 1990s, the dominant text classification methodology had been based on the classification of Knowledge engineering, which was manually categorized by professionals. Manual sorting is very time consuming and very inefficient. Since the 90 's, many statistical methods and machine learning methods have been applied to automatic text classification, and the research of text categorization technology has aroused great interest of researchers. At present, Chinese text classification has been researched in China, and it has been applied in many fields such as information retrieval, automatic classification of Web documents, digital library, automatic summarization, classified newsgroups, text filtering, semantic discrimination of single words, and the organization and management of documents.
History
The study of text classification can be traced back to the 60 's, the early text classification is mainly based on knowledge engineering (Knowledge Engineering), by manually defining some rules to classify the text, this method is time-consuming and must have enough knowledge of a certain field, To write the right rules. In the 90 's, with the emergence of online text and machine learning, large-scale texts (including Web pages) were categorized and retrieved to arouse the interest of researchers. The text classification system begins by training on pre-categorized text sets, establishing a discriminant rule or classifier to automatically classify new samples of unknown categories. A large number of results show that its classification accuracy than the results of expert manual classification, and its learning does not require expert intervention, can be applied to any field of learning, making it the current text classification of the mainstream method.
In 1971, Rocchio proposed to make a simple linear classifier by correcting the class weight vectors by user feedback in user queries. Mark Vanuden, Mun, and others give some other ways to modify the weights. In 1979, Van Rijsbergen a systematic summary of research in the field of information retrieval, including some concepts of information retrieval, such as vector space model and evaluation criteria such as accuracy rate (Precision), call back rate (Recall), Later, the text classification was introduced, and the probability model of information retrieval was discussed emphatically, and the later text classification research was based on the probabilistic model.
In 1992, Lewis systematically introduced the various details of the implementation of the text classification system in his doctoral dissertation, Representation and learning in information retrieval. and tested on the dataset Reuters22173 (which later removed some duplicate text revisions to the Reuters21578 dataset). This doctoral dissertation is a classic in the field of text categorization. Later researchers have done a lot of work on the design of feature descending and classifier, yiming Yang on various feature selection methods, including information gain (information Gain), Mutual information (Mutual information), statistics, etc. The analysis and comparison are made from the experiment. In 1997, she also conducted a parade of nearly all text classification methods reported in the literature, comparing the performance of each classifier with the public data set Reuters21578 and Ohsumed, which played an important role in later research.
In 1995, Vipnik based on statistical theory, the support vector machines (SVM) method was proposed, and the basic idea was to find the best high-dimensional categorical super-plane. Because of its mature small sample statistics theory as the cornerstone, so in machine learning field has received a wide attention. Thorsten Joachims for the first time, the support vector machine for linear kernel functions is used for text classification, and the SVM has a great improvement in classification performance compared with the traditional algorithm, and it shows the robustness of the algorithm on different data sets. So far, the theory and application of support vector machine is still a hot research topic.
While support for Vector machines appeared, 1995 and later, with the Yoav Freund and Robert E. Schapire published on AdaBoost's paper as a sign, the study of machine learning algorithm appeared another peak. Roberte.schapire the rationality of AdaBoost algorithm framework from theory and experiment. The researchers then gave a lot of similar boosting algorithms in this framework, compared with the representative of real Adaboost,gentle Boost,logitboost and so on. These boosting algorithms have been applied to the research of text classification, and have obtained the same good effect as support vector machine.
In a word, although machine learning theory has played an important role in the research of text classification, the research of text classification has been at a low ebb, but the practical application of text classification and its inherent characteristics have brought new challenges to machine learning, which makes the research of text classification still an open one in the field of information processing. Important Research direction.
Chinese text classification
Compared to the English text classification, one of the important differences in Chinese text classification lies in the preprocessing stage: Chinese text reading needs participle, not like the English text words that have spaces to distinguish. From the simple method of dictionary search to the later Word segmentation method based on statistical language model, Chinese word segmentation technology has matured. The more influential one is the Chinese lexical Analysis System (Ictclas), which is developed by Chinese Academy of Sciences, and is now publicly available for Chinese text classification.
For a long time, the study of Chinese text categorization has no public data set, making the classification algorithm difficult to compare. Now commonly used in the Chinese test set: Beijing University established by the People's Daily Corpus, Tsinghua University, the establishment of modern Chinese corpus .
In fact, once pre-processing the Chinese text into a sample vector data matrix, then the text classification process and the English text classification is the same, that is, the subsequent text classification process is independent of the language. Therefore, the current classification of Chinese text mainly focuses on how to use some characteristics of Chinese itself to better represent the text samples .
Key technologies and Methods
Word Segmentation technology
For the Chinese text, because there is no obvious segmentation mark between the word and the word, the Chinese text must first be participle. Now the word segmentation method Although there are many, but summed up just two: one is the mechanical Word segmentation method , generally based on the word segmentation dictionary, through the document of Chinese characters in the string and vocabulary word by match to complete the word segmentation. Another is the understanding of the Word segmentation method , that is, the use of Chinese grammar knowledge and semantic knowledge and psychological knowledge of word segmentation, the need to establish a segmentation database, knowledge base and inference library. The latter is an ideal method, but before the grammatical analysis, semantic analysis and even the textual understanding has not been solved, the word segmentation system mainly uses the mechanical Word segmentation method, or somewhere between the two word segmentation method.
Text Representation
The computer does not have human intelligence, cannot read the text, therefore must transform the text into the form which the computer can understand, namely carries on the text expression. The current text representation model is mainly the vector space model (VSM) proposed by Gerard Salton and McGill in 1969. The basic idea of the vector space model is to simplify the document into the vector representation of the component weight of the feature item: (W1,W2,..., WN), where WI is the weight of the I feature term, and the general selection term as the feature item, the weight is expressed by the word frequency. Word frequency is divided into absolute word frequency and relative word frequency. Absolute word frequency, that is, the use of words in the text appears in the frequencies of the text; relative frequency, that is, normalized frequency, its calculation method mainly uses TF-IDF formula.
In addition to the vector space model, there are probabilistic models. The probability model also considers the correlation between words and the words, and divides the documents in the text set into related documents and irrelevant documents. Based on the theory of probability in mathematics, the probability of these words appearing between relevant documents and unrelated documents is expressed by giving them a probability value, and then the probability of the correlation between documents is calculated, and the system makes the decision according to the probability.
Feature selection and feature extraction
Because the semi-structure of text data is even unstructured, when a feature vector is used to represent a document, the eigenvectors usually reach tens of thousands of dimensions or even hundreds of thousands of dimensions. It is very important to find an effective method to reduce the dimension of feature space, to improve the efficiency and precision of the classification, and to make the automatic classification of text. The dimensionality reduction technique can be divided into two categories: Feature selection and feature extraction.
Text classification algorithm
The core problem of the text automatic classification is how to construct the classification function (classifier), and the classification function needs to be acquired by some algorithm. Classification is an important method of data mining, in text classification, there are almost as many methods as the general classification. In many text classification algorithms, the rocchio algorithm, naive Bayesian classification algorithm,k -Nearest neighbor algorithm, decision tree algorithm, neural network algorithm and support vector machine algorithm are emphatically introduced.
from:http://wiki.52nlp.cn/