The supplement of KNN algorithm

Source: Internet
Author: User

Automatic text categorization is the basis of word management. By fast and accurate text automatic classification, can save a lot of manpower and money, improve work efficiency, let users quickly get the resources needed to improve the user experience. In this paper, the KNN text classification algorithm is introduced and an improved method is proposed.

Introduction of related theories

The research of text categorization technology has been long-standing, and has made a lot of gratifying achievements, and formed a complete automatic text categorization process.

(1) Text classification

Text classification is based on the training sample set of samples to train, to find a certain classification rules and laws, and then according to these rules and regulations on the need to classify the text to determine, automatically classify it.

(2) Text representation

To achieve automatic categorization based on content, the text needs to be expressed in a form that the machine can "understand". At present, the most mature is the vector space model (vsm:vector). The text is converted into a vector of text, and then the classification is judged by calculating the similarity between the vectors.

(3) Text participle

Using a certain rule method, the Chinese text content is divided into a word with basic semantics for extracting the feature items.

(4) to stop using words

In the process of extracting feature items from text, it is necessary to filter out those words with very high word frequency, which will not have any representation effect on classification and affect the result of classification, so as to improve the accuracy of classification. For example: prepositions, conjunctions and so on.

(5) Feature selection and weight calculation

The feature selection is to select some words that have a symbolic meaning to the classification, and use them as a feature item to construct the text vector space according to the selection method from the word word processing. Then the individual text is represented as a vector in this vector space, then the appropriate weight algorithm is chosen, the weights of each feature item are calculated, and the text is quantified.

(6) Classification algorithm

The classification algorithm uses certain rules to judge the vectorization text and automatically classifies it. The process can be divided into two stages: the training stage and the classification stage.

The training stage is based on a given set of training samples that have been well-trained, learning the rules and laws of classification, and establishing classifiers. The classification stage is to use the classifier established in the training stage to judge the classification text and realize the automatic classification.

The training stage is based on a given set of training samples that have been well-trained, learning the rules and laws of classification, and establishing classifiers. The classification stage is to use the classifier established in the training stage to judge the classification text and realize the automatic classification.

Two, KNN algorithm analysis

The KNN algorithm, which dates back to half a century ago, was proposed by cover and Hart in 1968. KNN algorithm is a relatively mature algorithm, the application of a wide range.

The KNN classification algorithm can simply describe its classification process as:

(1) Text preprocessing, namely text word segmentation, de-use words, feature selection, weight calculation, the text of the training set is represented as a text vector and stored.

(2) Calculate the similarity between the text vectors to be classified and all the text vectors in the training set, and select the text of the top k in the similarity degree from the training set.

(3) Classify the classified text according to the K-text. Which category of text in K text is more, then the text to be classified belongs to this category.

The advantages of KNN algorithm mainly focus on two aspects. Firstly, KNN algorithm is mature, stable and easy to realize. Secondly, the training process is very fast, just to train the newly added text.

The disadvantage of KNN algorithm is also very obvious. One point is that it takes a lot of time overhead. The KNN algorithm takes a lot of time in the classification phase because the KNN algorithm has no actual classifier. In the training stage, the KNN algorithm only treats the text of the sample set, but does not carry on the actual training study, but puts this kind of learning into the classification stage, which needs to calculate the similarity between the text and all the training sample sets in the classification stage, and makes the calculation amount of the classification stage quite large, thus affecting the classification time.

As the amount of data in the network is getting larger and higher, the requirement of classification time is more and more, so it is very meaningful to reduce the time cost by improving KNN algorithm.

Third, algorithm improvement

The KNN algorithm can be improved in several directions to reduce the time overhead:

(1) Sample set clipping. In order to reduce the space and time cost of KNN algorithm, it is the simplest and most effective method to screen the initial sample set. But at the same time to ensure that the screen is not at the expense of the accuracy of the classification, so it is necessary to select those features strong, can be very good to represent the text of the classification, so that the other features are not obvious, representative of the text is not strong, thus reducing the need for the classification of the number of text, shorten the classification time.

(2) The improvement of the method of K-nearest neighbor text is obtained. In order to obtain the final K-post nearest neighbor text, we need to calculate the similarity between the text to be classified and all the text in the training set, which can be obtained by comparing the calculation results. If you can avoid the similarity calculation with all text and directly find the K-text, it will greatly improve the classification speed.

This paper adopts the first method to improve KNN.

The KNN algorithm takes a lot of time in the classification phase because the KNN algorithm has no actual classifier. In order to make the KNN algorithm have better classification performance, it is necessary to set up the classifier in the training stage, put a lot of computation into the learning stage, and reduce the time cost of the classification stage. Considering that the class-centric vector method has very good classification speed, the class-center vector and KNN algorithm are combined to achieve the classification accuracy approaching KNN algorithm, and the classification speed approaches the effect of the class-centric vector method. The combined algorithm can be simply described in the following:

(1) All the text in the sample set is processed, and the resulting text vectors are saved.

(2) using the class-centric vector method, the center vectors of each kind are calculated.

(3) When classifying, we first calculate and sort the similarity between the text to be classified and each kind of center vector. Then determine a threshold m, taking the text in the top M class as a sample set using the KNN algorithm.

(4) In the new sample set using KNN algorithm for classification calculation, D classification.

Iv. Improved analysis

By combining the KNN algorithm and the class-centric vector method, the KNN algorithm can establish a simple classifier in the training stage, and its benefits are obvious:

When it is necessary to classify the classified text, the distance between the text and the center vector of the class can be obtained quickly by the class center vector obtained through the training stage, thus determining the M classes that need to be computed using the KNN algorithm. Finally, the sample set that actually participates in the KNN algorithm is the text in this M class, which reduces the sample set of sample participation similarity calculation in disguised form, and makes the computational amount decrease greatly. If M is half the number of categories, then half of the calculation can be reduced directly. The choice of the threshold M is decisive for the ultimate speed improvement. If m too small will directly affect the accuracy of the classification, too large for the classification speed of the promotion effect is not obvious, need to be determined by the experiment. In general, M is preferable to half the number of classifications.

Finally, the improved KNN algorithm is compared with the traditional KNN algorithm, and the classification time is shortened by more than half in the basis of guaranteeing classification accuracy. The improvements have yielded good results.

The supplement of KNN algorithm

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.