In this paper, the KNN algorithm does not do too much theoretical explanation, mainly for the problem, the design of the algorithm and the code annotation.
KNN algorithm:
Advantages: high precision, insensitive to abnormal values, no data input assumptions.
Disadvantages: High computational complexity and high space complexity.
applicable data range: numerical type and nominal nature.
How it works: There is a sample data set, also known as a training sample set, and there is a label for each data in the sample set, that is, we know the corresponding relationship between each data in the sample set and the classification. After entering new data without tags, we compare each feature of the new data with the characteristic of the data in the sample set, and then extract the classification label of the most similar data (nearest neighbor) in the sample set. In general, we only select the sample data and the first k of the most similar data, this is the K-nearest neighbor algorithm in the origin, usually K select not more than 20 of integers. Finally, the classification of the new data is chosen as the most frequent category in K-most similar data.
General process of K-nearest neighbor algorithm:
(1) data collection: Any method can be used
(2) preparation of data: distance calculation of the required value, preferably structured data format
(3) analysis data: can use any method
(4) training algorithm: This step does not apply K-neighbor algorithm
(5) test algorithm: calculate error Rate
(6) using algorithm: First, we need to input sample data and structured output results, then run K-Nearest neighbor algorithm to determine which input data belong to which classification, and finally apply to the calculation of the classification of the subsequent processing.
question one: Now let's assume that a scenario is to classify points on the coordinates, as shown in the following illustration:
With a total of 12 left points above, each coordinate point has the corresponding coordinates (X,Y) and the category A/b that it belongs to, all you need to do now is give a point coordinate (x1,y1) to determine which category A or B it belongs to.
All coordinate points in the Data.txt file:
[CPP] view plain copy 0.0 1.1 A 1.0 1.0 a 2.0 1.0 b 0.5 0.5 A 2.5 0.5 b 0.0 0.0 A 1.0 0.0 a 2.0 0. 0 B 3.0 0.0 b 0.0-1.0 A 1.0-1.0 a 2.0-1.0 b
Step1: initializes the training dataset DataSet and test data testdata through the class's default constructor.
Step2: use Get_distance () to calculate the distance between the test data testdata and each training data Dataset[index], using Map_index_dis to save the key value pair <index,distance Where index represents the first few training data, distance represents the distance between index training data and test data.
step3: Map_index_dis According to the value value (that is, distance value) from small to large order, and then take the first k of the smallest value, with map_label_freq to record the frequency of each class label appears.
Step4: traverses the value in Map_label_freq, returns the value of the largest key, which is the class that the test data belongs to.
Look at the code knn_0.cc:[CPP] View plain copy #include <iostream> #include <map> #include <vector> #include <stdio.h> #include <cmath> #include <cstdlib> # include<algorithm> #include <fstream> using namespace std; typedef char tlabel; typedef double tdata; typedef pair<int,double> pair; const int collen = 2 ; const int rowlen = 12; ifstream fin; Ofstream fout; class knn { private: tData dataSet[rowLen][colLen]; tLabel labels[rowLen]; tdata testdata[collen]; int k; map<int,double> map_index_dis; map<tLabel,int> map_label_freq; double get_distance (TDATA&NBSP;*D1,TDATA&NBSP;*D2); public: knn (int k); void get_all_distance (); void get_max_freq_label (); struct CmpByValue { &nbsP; bool operator () (const pair& lhs,const pair& &NBSP;RHS) { return lhs.second < rhs.second; } }; }; knn::knn (int k) { this->k = k; Fin.open ("Data.txt"); if (!fin) { cout<< "can not open The file daTa.txt "<<endl; exit (1); } /* input the dataset */ for (int i=0;i<rowlen;i++) { for (int j=0;j<collen;j++) { fin>>dataSet[i][j]; } fin>>labels[i]; } cout<< "please input the &NBSP;TEST&NBSP;DATA&NBSP: "<<endl; /* inuput the test data */ for (int i=0;i<collen;i++) cin>>testData[i]; } /* * calculate the distance between test data and dataSet[i] */ double knn:: get_distance (TDATA&NBSP;*D1,TDATA&NBSP;*D2) { double sum = 0; for (int i=0;i<collen;i++) { sum += pow ( (d1[i]-d2[i]) , 2 ); } // cout<< "the sum is = " < <sum<<endl; return sqrt (sum); } /*  * calculate all the distance between test data and each training data */ void knn:: get_all_distance () { double distance; int i; for (i=0;i<rowlen;i++) { distance = get_distance (dataset[i],testdata); //<key,value> => <i,distance> map_index_dis[i] = distance; } //traverse the map to print the index and distance map<int,double>::const_ IterAtor it = map_index_dis.begin (); while (it!=map_index_dis.end ()) { cout< < "index = " <<it->first<< " distance = " <<it->second<< endl; it++; } } /* * check which label the test data belongs to to classify the test data */ Void knn:: get_max_freq_label () { //transform the map_index_dis to vec_index_dis vector<pair> vec_ Index_dis ( map_index_dis.begin (), Map_index_dis.end () ); //sort& nbsp; the vec_index_dis by distance from low to high to get the nearest data sort (Vec_index_dis.begin (), Vec_index_dis.end (), Cmpbyvalue ()); for (int i=0;i<k;i++) { cout<< "The index = "<<vec_index_dis[i].first<<" the distance = "<<vec_index_ dis[i].second<< " the label = " <<labels[vec_index_dis[i].first]<< " the coordinate ( "<<dataSet[ vec_index_dis[i].first ][0]<<", "<< dataset[ vec_index_dis[i].first ][1]<< " )" <<endl; //calculate the count of each label map_label_freq[ labels[ vec_index_dis[i].first ] ]++; } map<tlabel,int>::const_iterator map_it = map_label_freq.begin (); tLabel label; int max_freq = 0; //find the most frequent label while (&NBSP;MAP_IT&NBSP;!=&NBSP;MAP_ Label_freq.end () ) { if ( map_it->second > max_freq ) { max_freq = map_it->second; label = map_it->first; } map_it++; } cout<< "the test data belongs to the " <<label << " label" <<endl; } int main () { int k ; cout<< "please input the k value : "<<endl; cin>>k; knn knn (k); knn.get_all_distance () ; knn.get_max_freq_label (); system (" Pause "); return 0; }
Let's test this classifier (k=5):
TestData (5.0,5.0):
TestData ( -5.0,-5.0):
TestData (1.6,0.5):
The correctness of the classification result can be judged by the coordinate system, and the result is correct.
problem Two: using K-Nearest neighbor algorithm to improve the matching effect of dating site
The scenario is as follows: My friend Helen has been using online dating sites to find the right date for her. Although the dating site would recommend different candidates, she didn't find anyone she liked. After summing up, she found that she had dated three different types of people:
> People I don't like.
> Charismatic people.
> Very Attractive people
Despite the discovery of these rules, Helen is still unable to classify the matching objects recommended by the dating site into the appropriate categories. She thinks she can date some glamorous people from Monday to Friday, and on weekends she prefers to be with people who are very attractive. Helen hopes that our classification software will help her to divide the matching objects into the exact categories. Helen also collected data that had not been recorded on the dating site, which she thought would help match the collation of the object.
Helen has been collecting data for some time. She kept the data in a text file DatingTestSet.txt (file Link: http://yunpan.cn/QUL6SxtiJFPfN), with each sample occupying one line, totaling 1000 lines. Helen's sample consists mainly of 3 features:
> Number of frequent flyer miles per year
> Percentage of time spent playing video games
> Weekly consumption of ice cream litres
data preprocessing: Normalization of data
We can see that the number of frequent flyer miles per year will have a much larger impact on the results than the other two features. The only reason for this is because frequent flyer books are much larger than other eigenvalues. But these three characteristics are equally important, so as one of the three characteristics of equal weights, the number of frequent flyer should not affect the calculation result so seriously.
When dealing with the eigenvalues of this different range of values, we usually use numerical normalization, such as processing the range of values from 0 to 1 or 1 to 1.
The formula is: NewValue = (oldvalue-min)/(Max-min)
The Min and Max are the minimum eigenvalues and the maximum eigenvalues of the data set respectively. We add a auto_norm_data function to the data.
Colleagues also design a get_error_rate to calculate the classification error rate, select 10% of the overall data as test data, 90% as training data, of course, you can set the percentage.
Other algorithmic designs are similar to the one in question.
The code is as follows knn_2.cc (k=7):
[CPP] View plain copy/* add the get_error_rate function */ #include <iostream> #include <map> #include <vector> #include < stdio.h> #include <cmath> #include <cstdlib> #include < algorithm> #include <fstream> using namespace std; typedef string tlabel; typedef double tdata; typedef pair<int,double> pair; const int maxcollen = 10; const int maxrowlen = 10000; ifstream fin; Ofstream fout; class knn { private: tData dataSet[MaxRowLen][MaxColLen]; tlabel labels[maxrowlen]; tdata testdata[maxcollen]; int rowLen; int colLen; int k; int test_data_num; map<int,double> map_index _dis; map<tLabel,int> map_label_freq; double get_distance (tdata *d1,tdata &NBSP;*D2);&