From: http://blog.csdn.net/lyflower/article/details/1728642
KNN algorithm in text classification, the idea of this method is very simple and intuitive: If a sample has K similarity in the feature space (that is, the nearest neighbor in the feature space) if most of the samples belong to a certain category, the samples also belong to this category. This method only determines the category of the samples to be classified based on the class of one or more adjacent samples.
The KNN method also relies on the Limit Theorem in principle, but in classification decision-making, it is only related to a very small number of adjacent samples. Therefore, this method can effectively avoid the imbalance of samples. In addition, the KNN method mainly depends on a limited number of adjacent samples, rather than the method for determining the category, therefore, for a sample set with many cross or overlapping classes,
KNN is more suitable than other methods.
The disadvantage of this method is that it requires a large amount of computing, because the distance from each text to be classified must be calculated to all known samples before K Nearest Neighbor points can be obtained. Currently, the common solution is to edit known sample points in advance and remove samples that do not have much effect on classification. There is also a reverse
KNN method can reduce the computing complexity of KNN algorithm and improve the efficiency of classification.
This algorithm is more suitable for automatic classification of class domains with a large sample size, and the class domains with a small sample size use this algorithm to easily produce false scores.
The K-nearest neighbor classifier has a good text classification effect. Statistical analysis of the simulation results shows that, as a text classifier, K-neighbor is second only to support vector machines, obviously better than Linear Least Square Fitting, Naive Bayes and neural networks.
Important:
1: feature dimensionality reduction (generally using the chi method)
2: tail truncation algorithm (three types of tail truncation algorithms)
3. Reduced computing workload
DEMO code:
# Include "ML. H "# include" highgui. H "int main (INT argc, char ** argv) {const int K = 10; int I, J, K, accuracy; float response; int train_sample_count = 100; cvrng rng_state = cvrng (-1); // initialize the cvmat * traindata = cvcreatemat (train_sample_count, 2, cv_32fc1); cvmat * trainclasses = cvcreatemat (cost, 1, cv_32fc1); iplimage * IMG = cvcreateimage (cvsize (500,500), 8, 3); flo AT _ sample [2]; cvmat sample = cvmat (1, 2, cv_32fc1, _ sample); cvzero (IMG); cvmat traindata1, traindata2, trainclasses1, trainclasses2; // form the training samples cvgetrows (traindata, & traindata1, 0, train_sample_count/2); // returns a row of an array or a row of cvrandarr (& rng_state, & traindata1, cv_rand_normal, cvscalar (200,200), cvscalar (50, 50); // fill the array with a random number and update the RNG status cvgetrows (traindata, & traindata2, train _ Sample_count/2, rows); cvrandarr (& rng_state, & traindata2, cv_rand_normal, cvscalar (300,300), cvscalar (); cvgetrows (trainclasses, & trainclasses1, 0, values/2); cvset (& trainclasses1, cvscalar (1); cvgetrows (trainclasses, & trainclasses2, train_sample_count/2, train_sample_count); cvset (& trainclasses2, cvscalar (2); // learn classifier cvknearest KNN (traind ATA, trainclasses, 0, false, k); cvmat * nearests = cvcreatemat (1, K, cv_32fc1); for (I = 0; I height; I ++) {for (j = 0; j width; j ++) {sample. data. FL [0] = (float) J; sample. data. FL [1] = (float) I; // estimates the response and get the neighbors 'labels response = KNN. find_nearest (& sample, K, 0, 0, nearests, 0); // compute the number of neighbors representing the majority for (k = 0, Ccuracy = 0; k <K; k ++) {If (nearests-> data. FL [k] = Response) Accuracy ++;} // highlight the pixel depending on the accuracy (or confidence) cvset2d (IMG, I, j, response = 1? (Accuracy> 5? Cv_rgb (18180,120, 0): cv_rgb (, 0): (accuracy> 5? Cv_rgb (0,180, 0): cv_rgb (120,120, 0); }}// display the original training samples for (I = 0; I <train_sample_count/2; I ++) {cvpoint pt; PT. X = cvround (traindata1.data. FL [I * 2]); PT. y = cvround (traindata1.data. FL [I * 2 + 1]); cvcircle (IMG, PT, 2, cv_rgb (255, 0), cv_filled); PT. X = cvround (traindata2.data. FL [I * 2]); PT. y = cvround (traindata2.data. FL [I * 2 + 1]); cvcircle (IMG, PT, 2, cv_rgb (0,255, 0), cv_filled);} cvnamedwindow ("classifier result", 1 ); cvshowimage ("classifier result", IMG); cvwaitkey (0); cvreleasemat (& trainclasses); cvreleasemat (& traindata); Return 0 ;}
Http://www.cnblogs.com/xiangshancuizhu/archive/2011/08/06/2129355.html
Improved KNN: http://www.cnblogs.com/xiangshancuizhu/archive/2011/11/11/2245373.html