Transferred from: http://blog.csdn.net/lyflower/article/details/1728642
KNN algorithm in text classification, the idea of this method is very simple and intuitive: if a sample in the feature space in the K most similar (that is, the most adjacent in the feature space) of the sample is a category, then the sample belongs to this category. This method determines the category to which the sample is to be divided based on the category of the nearest one or several samples in the categorical decision-making.
Although the KNN method relies on the limit theorem in principle, it is only related to a very small number of adjacent samples in the class decision. Therefore, the problem of sample imbalance can be better avoided by using this method. In addition, the KNN method is more suitable than other methods, because the KNN method mainly relies on the surrounding finite sample, rather than the Discriminant class domain method to determine the category of the class, so it is more appropriate for the sample set to be divided by the cross or overlap.
The disadvantage of this method is that it is computationally large, because each text to be classified is calculated from its distance to all known samples in order to obtain its K nearest neighbor points. At present, the common solution is to pre-edit the known sample points in advance to remove the small sample of the role of classification. In addition, there is a reverse KNN method, which can reduce the computational complexity of KNN algorithm and improve the efficiency of classification.
This algorithm is suitable for the automatic classification of the class domain with large sample capacity, while those with smaller sample capacity are more prone to error points.
K Neighbor classifier has good text classification effect, the statistical analysis of simulation results shows that: as a text classifier, K nearest neighbor is second only to support vector machine, and obviously better than linear least squares fitting, naive Bayesian and neural network.
Focus:
1: Characteristic dimensionality reduction (general with Chi method)
2: Truncated algorithm (three truncation algorithms)
3: Reduce the amount of computation
Demo Code:
[CPP]View Plain copy print?
- #include "ml.h"
- #include "highgui.h"
- int main (int argc, char** argv)
- {
- const int K = 10;
- int I, j, K, accuracy;
- float response;
- int train_sample_count = 100;
- Cvrng rng_state = cvrng (-1);//Initialize random number generator state
- cvmat* traindata = Cvcreatemat (Train_sample_count, 2, CV_32FC1);
- cvmat* trainclasses = Cvcreatemat (Train_sample_count, 1, CV_32FC1);
- iplimage* img = cvcreateimage (cvsize (500, 500), 8, 3);
- float _sample[2];
- Cvmat sample = Cvmat (1, 2, CV_32FC1, _sample);
- Cvzero (IMG);
- Cvmat trainData1, TrainData2, TrainClasses1, TrainClasses2;
- form the training samples
- Cvgetrows (Traindata, &traindata1, 0, TRAIN_SAMPLE_COUNT/2); Returns a row of an array or a line within a certain span
- Cvrandarr (&rng_state, &traindata1, Cv_rand_normal, Cvscalar (200,200), Cvscalar (50,50)); Fills an array with random numbers and updates the RNG state
- Cvgetrows (Traindata, &traindata2, TRAIN_SAMPLE_COUNT/2, Train_sample_count);
- Cvrandarr (&rng_state, &traindata2, Cv_rand_normal, Cvscalar (300,300), Cvscalar (50,50));
- Cvgetrows (trainclasses, &trainclasses1, 0, TRAIN_SAMPLE_COUNT/2);
- Cvset (&trainclasses1, Cvscalar (1));
- Cvgetrows (trainclasses, &trainclasses2, TRAIN_SAMPLE_COUNT/2, Train_sample_count);
- Cvset (&trainclasses2, Cvscalar (2));
- Learn classifier
- Cvknearest KNN (traindata, trainclasses, 0, False, K);
- cvmat* nearests = Cvcreatemat (1, K, CV_32FC1);
- for (i = 0; i < img->height; i++)
- {
- for (j = 0; J < img->width; J + +)
- {
- Sample.data.fl[0] = (float) j;
- SAMPLE.DATA.FL[1] = (float) i;
- Estimates the response and get the neighbors ' labels
- Response = Knn.find_nearest (&sample,k,0,0,nearests,0);
- Compute the number of neighbors representing the majority
- for (k = 0, accuracy = 0; k < K; k++)
- {
- if (nearests->data.fl[k] = = response)
- accuracy++;
- }
- Highlight the pixel depending on the accuracy (or confidence)
- CVSET2D (IMG, I, j, response = = 1?)
- (Accuracy > 5?) Cv_rgb (180,0,0): Cv_rgb (180,120,0)):
- (Accuracy > 5?) Cv_rgb (0,180,0): Cv_rgb (120,120,0));
- }
- }
- Display the original training samples
- for (i = 0; i < TRAIN_SAMPLE_COUNT/2; i++)
- {
- Cvpoint pt;
- Pt.x = Cvround (traindata1.data.fl[i*2]);
- Pt.y = Cvround (traindata1.data.fl[i*2+1]);
- Cvcircle (IMG, PT, 2, Cv_rgb (255,0,0), cv_filled);
- Pt.x = Cvround (traindata2.data.fl[i*2]);
- Pt.y = Cvround (traindata2.data.fl[i*2+1]);
- Cvcircle (IMG, PT, 2, Cv_rgb (0,255,0), cv_filled);
- }
- Cvnamedwindow ("classifier result", 1);
- Cvshowimage ("classifier result", IMG);
- Cvwaitkey (0);
- Cvreleasemat (&trainclasses);
- Cvreleasemat (&traindata);
- return 0;
- }
Detailed Description: http://www.cnblogs.com/xiangshancuizhu/archive/2011/08/06/2129355.html
The improved knn:http://www.cnblogs.com/xiangshancuizhu/archive/2011/11/11/2245373.html
from:http://blog.csdn.net/yangtrees/article/details/7482890
Learning OPENCV--KNN algorithm