"One of machine learning combat": C + + implementation of K-nearest neighbor algorithm KNN

Source: Internet
Author: User
Tags pow cmath

In this paper, the KNN algorithm does not do too much theoretical explanation, mainly for the problem, the design of the algorithm and the code annotation.

KNN algorithm:

Advantages: high precision, insensitive to abnormal values, no data input assumptions.

Disadvantages: High computational complexity and high space complexity.

applicable data range: numerical type and nominal nature.

How it works: There is a sample data set, also known as a training sample set, and there is a label for each data in the sample set, that is, we know the corresponding relationship between each data in the sample set and the classification. After entering new data without tags, we compare each feature of the new data with the characteristic of the data in the sample set, and then extract the classification label of the most similar data (nearest neighbor) in the sample set. In general, we only select the sample data and the first k of the most similar data, this is the K-nearest neighbor algorithm in the origin, usually K select not more than 20 of integers. Finally, the classification of the new data is chosen as the most frequent category in K-most similar data.

General process of K-nearest neighbor algorithm:

(1) data collection: Any method can be used

(2) preparation of data: distance calculation of the required value, preferably structured data format

(3) analysis data: can use any method

(4) training algorithm: This step does not apply K-neighbor algorithm

(5) test algorithm: calculate error Rate

(6) using algorithm: First, we need to input sample data and structured output results, then run K-Nearest neighbor algorithm to determine which input data belong to which classification, and finally apply to the calculation of the classification of the subsequent processing.


question one: Now let's assume that a scenario is to classify points on the coordinates, as shown in the following illustration:



With a total of 12 left points above, each coordinate point has the corresponding coordinates (X,Y) and the category A/b that it belongs to, all you need to do now is give a point coordinate (x1,y1) to determine which category A or B it belongs to.

All coordinate points in the Data.txt file:
0.01.1A 1.01.0A 2.01.0B 0.50.5A 2.50.5B 0.00.0A 1.00.0A 2.00.0B 3.00.0B 0.0-1.0a 1.0-1.0a 2.0-1.0b


Step1: initializes the training dataset DataSet and test data testdata through the class's default constructor.

Step2: use Get_distance () to calculate the distance between the test data testdata and each training data Dataset[index], using Map_index_dis to save the key value pair <index,distance Where index represents the first few training data, distance represents the distance between index training data and test data.

step3: Map_index_dis According to the value value (that is, distance value) from small to large order, and then take the first k of the smallest value, with map_label_freq to record the frequency of each class label appears.

Step4: traverses the value in Map_label_freq, returns the value of the largest key, which is the class that the test data belongs to.


Look at the code knn_0.cc: #include <iostream> #include <map> #include <vector> #include <stdio.h> #include <cmath> # include<cstdlib> #include <algorithm> #include <fstream> usingnamespacestd; Typedefchartlabel; Typedefdoubletdata; typedefpair<int,double>pair; constintcollen=2; constintrowlen=12; Ifstreamfin; Ofstreamfout; CLASSKNN {Private:tdatadataset[rowlen][collen]; Tlabellabels[rowlen]; Tdatatestdata[collen]; intk; Map<int, double>map_index_dis; map<tlabel,int>map_label_freq; Doubleget_distance (TDATA*D1,TDATA*D2); PUBLIC:KNN (INTK); Voidget_all_distance (); Voidget_max_freq_label (); Structcmpbyvalue {booloperator () (CONSTPAIR&AMP;LHS,CONSTPAIR&AMP;RHS) {returnlhs.second<rhs.second;}}; }; KNN::KNN (INTK) {this->k=k; fin.open ("Data.txt"); if (!fin) {cout<< "Cannotopenthefiledata.txt" <<endl; Exit (1); }/*inputthedataset*/for (inti=0;i<rowlen;i++) {for (intj=0;j<collen;j++) {fin>>dataset[i][j];} fin> >labels[i]; } cout<< "Pleaseinputthetestdata:" <<endl; /*inuputthetestdata*/for (inti=0;i<collen;i++) cin>>testdata[i]; }/* *calculatethedistancebetweentestdataanddataset[i] */doubleknn::get_distance (TDATA*D1,TDATA*D2) {doublesum=0; for (inti=0;i<collen;i++) {Sum+=pow (D1[i]-d2[i]), 2;}//cout<< "thesumis=" <<sum<<endl; RETURNSQRT (sum); }/* *calculateallthedistancebetweentestdataandeachtrainingdata * * voidknn::get_all_distance () {doubledistance; inti ; for (i=0;i<rowlen;i++) {distance=get_distance (dataset[i],testdata);//<key,value>=><i,distance> Map_index_dis[i]=distance; }//traversethemaptoprinttheindexanddistance Map<int,double>::const_iteratorit=map_index_dis.begin (); while (It!=map_index_dis.end ()) {cout<< "index=" <<it->first<< "distance=" <<it-> second<<endl; it++; }/* *checkwhichlabelthetestdatabelongstotoclassifythetestdata/Voidknn::get_max_freq_label () {//transformthemap _index_distovEc_index_dis Vector<pair>vec_index_dis (Map_index_dis.begin (), Map_index_dis.end ()); Sortthevec_index_disbydistancefromlowtohightogetthenearestdata sort (Vec_index_dis.begin (), Vec_index_dis.end () , Cmpbyvalue ()); for (inti=0;i<k;i++) {cout<< "theindex=" <<vec_index_dis[i].first<< "thedistance=" <<vec_ index_dis[i].second<< "thelabel=" <<labels[vec_index_dis[i].first]<< "Thecoordinate" ("<< dataset[vec_index_dis[i].first][0]<< "," <<dataSet[vec_index_dis[i].first][1]<< ")" <<endl ; Calculatethecountofeachlabel map_label_freq[labels[vec_index_dis[i].first]]++; } map<tlabel,int>::const_iteratormap_it=map_label_freq.begin (); Tlabellabel; intmax_freq=0; Findthemostfrequentlabel while (Map_it!=map_label_freq.end ()) {if (map_it->second>max_freq) {MAX_FREQ=MAP_ it->second; label=map_it->first; } map_it++; } cout<< "Thetestdatabelongstothe" <<label<< "label" <<endl; } intmain () {INTK; cout<< "Pleaseinputthekvalue:" <<endl; cin>>k; KNNKNN (k); Knn.get_all_distance (); Knn.get_max_freq_label (); System ("pause"); Return0; }


Let's test this classifier (k=5):

TestData (5.0,5.0):



TestData ( -5.0,-5.0):



TestData (1.6,0.5):



The correctness of the classification result can be judged by the coordinate system, and the result is correct.


problem Two: using K-Nearest neighbor algorithm to improve the matching effect of dating site

The scenario is as follows: My friend Helen has been using online dating sites to find the right date for her. Although the dating site would recommend different candidates, she didn't find anyone she liked. After summing up, she found that she had dated three different types of people:

> People I don't like.

> Charismatic people.

> Very Attractive people

Despite the discovery of these rules, Helen is still unable to classify the matching objects recommended by the dating site into the appropriate categories. She thinks she can date some glamorous people from Monday to Friday, and on weekends she prefers to be with people who are very attractive. Helen hopes that our classification software will help her to divide the matching objects into the exact categories. Helen also collected data that had not been recorded on the dating site, which she thought would help match the collation of the object.

Helen has been collecting data for some time. She holds the data in a text file DatingTestSet.txt (file link: Http://yunpan.cn/QUL6SxtiJFPfN, extract Code: f246), each sample occupies one row, a total of 1000 lines. Helen's sample consists mainly of 3 features:

> Number of frequent flyer miles per year

> Percentage of time spent playing video games

> Weekly consumption of ice cream litres


data preprocessing: Normalization of data

We can see that the number of frequent flyer miles per year will have a much larger impact on the results than the other two features. The only reason for this is because frequent flyer books are much larger than other eigenvalues. But these three characteristics are equally important, so as one of the three characteristics of equal weights, the number of frequent flyer should not affect the calculation result so seriously.

When dealing with the eigenvalues of this different range of values, we usually use numerical normalization, such as processing the range of values from 0 to 1 or 1 to 1.

The formula is: newvalue= (oldvalue-min)/(Max-min)

The Min and Max are the minimum eigenvalues and the maximum eigenvalues of the data set respectively. We add a auto_norm_data function to the data.

Colleagues also design a get_error_rate to calculate the classification error rate, select 10% of the overall data as test data, 90% as training data, of course, you can set the percentage.

Other algorithmic designs are similar to the one in question.


The code is as follows knn_2.cc (k=7):

/*addtheget_error_ratefunction*/#include <iostream> #include <map> #include <vector> #include < stdio.h> #include <cmath> #include <cstdlib> #include <algorithm> #include <fstream> USINGNAMESPACESTD; Typedefstringtlabel; Typedefdoubletdata; typedefpair<int,double>pair; constintmaxcollen=10; constintmaxrowlen=10000; Ifstreamfin; Ofstreamfout; CLASSKNN {Private:tdatadataset[maxrowlen][maxcollen]; Tlabellabels[maxrowlen]; Tdatatestdata[maxcollen]; introwLen ; Intcollen; INTK; Inttest_data_num; map<int,double>map_index_dis; map<tlabel,int>map_label_freq; Doubleget_distance (TDATA*D1,TDATA*D2); PUBLIC:KNN (Intk,introwlen,intcollen,char*filename); Voidget_all_distance (); Tlabelget_max_freq_label (); Voidauto_norm_data (); Voidget_error_rate (); Structcmpbyvalue {booloperator () (CONSTPAIR&AMP;LHS,CONSTPAIR&AMP;RHS) {returnlhs.second<rhs.second;}}; ~KNN (); }; KNN::~KNN () {fin.close (); Fout.close (); Map_index_dis.clear (); map_label_Freq.clear (); } knn::knn (intk,introw,intcol,char*filename) {this->rowlen=row; this->collen=col; this->k=k; test_data_num = 0; Fin.open (filename); Fout.open ("Result.txt"); if (!fin| |! Fout) {cout<< "Cannotopenthefile" <<endl; exit (0);} for (inti=0;i<rowlen;i++) {for (intj=0;j<collen; J + +) {fin>>dataset[i][j]; fout<<dataset[i][j]<< "";} fin>>labels[i]; fout<<labels[i]<<endl; } voidknn::get_error_rate () {inti,j,count=0; Tlabellabel cout<< "Pleaseinputthenumberoftestdata:" << Endl cin>>test_data_num; for (i=0;i<test_data_num;i++) {for (j=0;j<collen;j++) {testdata[j]=dataset[i][j];} get_all_distance (); label= Get_max_freq_label (); if (label!=labels[i]) count++; Map_index_dis.clear (); Map_label_freq.clear (); } cout<< "theerrorrateis=" << (double) count/(double) test_data_num<<endl; } doubleknn::get_distance (TDATA*D1,TDATA*D2) {doublesum=0; for (inti=0;i<collen;i++) {Sum+=pow ((d1[i]-d2[i)), 2); }//cout<< "thesumis=" <<sum<<endl; RETURNSQRT (sum); } voidknn::get_all_distance () {doubledistance; Inti for (i=test_data_num;i<rowlen;i++) {distance=get_distance ( Dataset[i],testdata); Map_index_dis[i]=distance; }//Map<int,double>::const_iteratorit=map_index_dis.begin (); while (It!=map_index_dis.end ())//{//cout<< "index=" <<it->first<< "distance=" <<it- >second<<endl; it++; } Tlabelknn::get_max_freq_label () {Vector<pair>vec_index_dis (Map_index_dis.begin (), Map_index_dis.end ()) ; Sort (Vec_index_dis.begin (), Vec_index_dis.end (), Cmpbyvalue ()); for (inti=0;i<k;i++) {cout<< "theindex=" <<vec_index_dis[i].first<< "thedistance=" <<vec_ index_dis[i].second<< "thelabel=" <<labels[vec_index_dis[i].first]<< "thecoordinate"; INTJ; for (j=0;j<collen-1;j++) {cout<<dataset[vec_index_dis[i].first][j]<< ",";} cout<<dataSet[vec_ index_dis[i].first][j]<< ")" << Endl; map_label_freq[labels[vec_index_dis[i].first]]++; } map<tlabel,int>::const_iteratormap_it=map_label_freq.begin (); Tlabellabel; intmax_freq=0; while (Map_it!=map_label_freq.end ()) {if (map_it->second>max_freq) {max_freq=map_it->second; label=map_it- >first; } map_it++; } cout<< "Thetestdatabelongstothe" <<label<< "label" <<endl; Returnlabel; } voidknn::auto_norm_data () {Tdatamaxa[collen]; Tdatamina[collen]; Tdatarange[collen]; inti,j; for (i=0;i<colLen;i + +) {Maxa[i]=max (dataset[0][i],dataset[1][i]); Mina[i]=min (Dataset[0][i],dataset[1][i]);} for (i=2;i<rowlen;i++ {for (j=0;j<collen;j++) {if (Dataset[i][j]>maxa[j]) {maxa[j]=dataset[i][j];} elseif (Dataset[i][j]<mina[j) ) {mina[j]=dataset[i][j];} for (i=0;i<collen;i++) {range[i]=maxa[i]-mina[i];//normalizethetestdataset testdata[i]= (testData[i]-mina[i))/ Range[i]; }//normalizethetrainingdataset for (i=0;i<rowlen;i++) {for (j=0;j<collen;j++) {dataset[i][j]= (dataSeT[I][J]-MINA[J])/range[j]; }} intmain (INTARGC,CHAR**ARGV) {intk,row,col; char*filename; if (argc!=5) {cout<< "theinputshouldbelikethis:./ A.outkrowcolfilename "<<endl; Exit (1); } k=atoi (argv[1]); Row=atoi (argv[2]); Col=atoi (Argv[3]); FILENAME=ARGV[4]; KNNKNN (K,row,col,filename); Knn.auto_norm_data (); Knn.get_error_rate (); Knn.get_all_distance (); Knn.get_max_freq_label (); Return0; }
Makefile

target:g++knn_2.cc./a.out710003datingtestset.txt


Results:

You can see that in the ratio of test data to 10% and training data 90%, you can see that the error rate is 4%, which is relatively accurate.


To build a fully available system:

The classifier has been tested by using data, and you can now use the classifier to categorize people for Helenlai.

Code knn_1.cc (k=7):

/*addtheauto_norm_data*/#include <iostream> #include <map> #include <vector> #include <stdio.h > #include <cmath> #include <cstdlib> #include <algorithm> #include <fstream> USINGNAMESPACESTD; Typedefstringtlabel; Typedefdoubletdata; typedefpair<int,double>pair; constintmaxcollen=10; constintmaxrowlen=10000; Ifstreamfin; Ofstreamfout; CLASSKNN {Private:tdatadataset[maxrowlen][maxcollen]; Tlabellabels[maxrowlen]; Tdatatestdata[maxcollen]; introwLen ; Intcollen; INTK; map<int,double>map_index_dis; map<tlabel,int>map_label_freq; Doubleget_distance (TDATA*D1,TDATA*D2); PUBLIC:KNN (Intk,introwlen,intcollen,char*filename); Voidget_all_distance (); Tlabelget_max_freq_label (); Voidauto_norm_data (); Structcmpbyvalue {booloperator () (CONSTPAIR&AMP;LHS,CONSTPAIR&AMP;RHS) {returnlhs.second<rhs.second;}}; ~KNN (); }; KNN::~KNN () {fin.close (); Fout.close (); Map_index_dis.clear (); Map_label_freq.clear ();} KNN::KNN (Intk,introw,intcol,char*filename) {this->rowlen=row; this->collen=col; this->k=k; fin.open (filename); Fout.open ("Result.txt"); if ( !fin| |! Fout) {cout<< "Cannotopenthefile" <<endl; exit (0);}//inputthetrainingdataset for (inti=0;i<rowlen;i++ {for (intj=0;j<collen;j++) {fin>>dataset[i][j]; fout<<dataset[i][j]<< "";} fin>>labels[ I]; fout<<labels[i]<<endl; }//inputthetestdata cout<< "Frequentfliermilesearnedperyear?"; cin>>testdata[0]; cout<< "Percentageoftimespentplayingvideogames"; cin>>testdata[1]; cout<< "Litersoficecreamconsumedperyear"; cin>>testdata[2]; } doubleknn::get_distance (TDATA*D1,TDATA*D2) {doublesum=0; for (inti=0;i<collen;i++) {Sum+=pow ((d1[i]-d2[i)), 2); RETURNSQRT (sum); } voidknn::get_all_distance () {doubledistance; Inti for (i=0;i<rowlen;i++) {distance=get_distance (dataSet[i), TestData); Map_index_dis[i]=distance; }//Map<int,double>::const_iteratorit=map_index_dis.begin (); WHile (It!=map_index_dis.end ())//{//cout<< "index=" <<it->first<< "distance=" <<it-> second<<endl; it++; } Tlabelknn::get_max_freq_label () {Vector<pair>vec_index_dis (Map_index_dis.begin (), Map_index_dis.end ()) ; Sort (Vec_index_dis.begin (), Vec_index_dis.end (), Cmpbyvalue ()); for (inti=0;i<k;i++) {/* cout<< "theindex=" <<vec_index_dis[i].first<< "thedistance=" << vec_index_dis[i].second<< "thelabel=" <<labels[vec_index_dis[i].first]<< "thecoordinate"; INTJ; for (j=0;j<collen-1;j++) {cout<<dataset[vec_index_dis[i].first][j]<< ",";} cout<<dataSet[vec_ index_dis[i].first][j]<< ")" <<endl; * * map_label_freq[labels[vec_index_dis[i].first]]++; } map<tlabel,int>::const_iteratormap_it=map_label_freq.begin (); Tlabellabel; intmax_freq=0; /*traversethemap_label_freqtogetthemostfrequentlabel*/while (Map_it!=map_label_freq.end ()) {if (Map_it->second >max_freq) {max_freq=map_it->second; label=map_it->first; } map_it++; } Returnlabel; }/* *normalizethetrainingdataset/Voidknn::auto_norm_data () {Tdatamaxa[collen]; Tdatamina[collen]; tDatarange[ Collen]; Inti,j; for (i=0;i<collen;i++) {Maxa[i]=max (dataset[0][i],dataset[1][i]); Mina[i]=min (Dataset[0][i],dataset[1][i]); for (i=2;i<rowlen;i++) {for (j=0;j<collen;j++) {if (Dataset[i][j]>maxa[j]) {maxa[j]=dataset[i][j];} elseif ( Dataset[i][j]<mina[j]) {mina[j]=dataset[i][j];}} for (i=0;i<collen;i++) {range[i]=maxa[i]-mina[i];//normalizethetestdataset testdata[i]= (testData[i]-mina[i))/ Range[i]; }//normalizethetrainingdataset for (i=0;i<rowlen;i++) {for (j=0;j<collen;j++) {dataset[i][j]= (dataSet[i][j]- MINA[J])/range[j]; }} intmain (INTARGC,CHAR**ARGV) {intk,row,col; char*filename; if (argc!=5) {cout<< "theinputshouldbelikethis:./ A.outkrowcolfilename "<<endl; Exit (1); } k=atoi (argv[1]); Row=atoi (argv[2]); Col=atoi (Argv[3]); FILENAME=ARGV[4]; KNNKNN (K,row,col,filename); Knn.auto_norm_data (); Knn.get_all_distance (); cout<< "Youwillprobablylikethisperson:" <<knn.get_max_freq_label () <<endl; Return0; }


Makefile

target:g++knn_1.cc./a.out710003datingtestset.txt
Results:



The difference between knn_1.cc and knn_2.cc is that the latter analyzes the classifier's performance (that is, the classification error rate), and the former directly classifies the actual data.


Annotated origin: http://blog.csdn.net/lavorange/article/details/16924705


Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.