The KNN algorithm implemented by Python

Last Update:2018-02-11 Source: Internet

Author: User

Tags pow square root

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

The KNN algorithm implemented by Python

　　Key words: KNN, K-Nearest neighbor (KNN) algorithm, Euclidean distance, Manhattan distance

KNN is classified by measuring the distance between different eigenvalues. The idea is that if a sample is the most similar in the K in the feature space (that is, the nearest neighbor in the feature space), the sample belongs to that category. K is usually an integer that is not greater than 20. In the KNN algorithm, the selected neighbors are the objects that have been correctly categorized. This method determines the category to which the sample is to be divided based on the category of the nearest one or several samples in the categorical decision-making.

In KNN, by calculating the distance between objects as a non-similarity between the objects, to avoid the matching between objects, where the distance is generally used Euclidean distance or Manhattan distance: at the same time, KNN is based on the category of the class K, rather than a single object category decision. These two points are the advantages of the KNN algorithm.

The idea of KNN algorithm is summed up: in the case of training centralized data and tags, the input of test data, the characteristics of the test data and the characteristics of the training set are compared with each other, to find the most similar training concentration of the first k data, The category that corresponds to the test data is the one that has the most occurrences in K data, and its algorithm is described as:

1) Calculate the distance between the test data and each training data;
2) Sort by the increment relation of distance;
3) Select K points with a minimum distance;
4) Determine the occurrence frequency of the category of the first k points;
5) return the category with the highest frequency in the first K points as the predictive classification for the test data.

#Coding:utf-8Importrequests, JSON, time, RE, OS, sys, timeImportUrllib2ImportRandomImportNumPy as NP#set to Utf-8 modeReload (SYS) sys.setdefaultencoding ("Utf-8" )#reading a text file, building a two-dimensional arraydefReaddatafile (filename,format):ifformat:Pass    Else: Format=','List= []    #Remove First spacefilename =Filename.strip ()#determine if a data file exists    ifos.path.isfile (filename):PassFile_object= open (filename,'RB') Lines=File_object.readlines () forLineinchlines:tmp=[] line=Line.strip () forValueinchLine.split (format) [:-1]: Tmp.append (float (value)) Tmp.append (line.split (format) [-1]) list.append (TMP)Else:        Print "%s is not exists"%(filename)returnList#reads text data, splits raw data for features and labels, returns eigenvalues and label valuesdefCreatedata (filename,format=','): Data_label=readdatafile (Filename,format)ifLen (Data_label) >0:label=[] Data= []        #Data_label = [[1,100,123, ' A '],[2,99,123, ' a '],[100,1,12, ' B '],[99,2,23, ' B ']         foreachinchData_label:label.append (each[-1]) data.append (each[:-1])        returnData,label#classification based on input data and test datadefcalculatedistance (input,data,label,k): Classes='Error'        ifLen (data[0]) ==0orLen (label) = =0:Print 'data or label is NULL'        Pass     elifK >len (data):Print "K:%s is out of bounds"%(k)Pass     elifLen (input) <>Len (data[0]):Print "feature variable value is not enough, the number of input variable features is:%s, the number of training features is:%s"%(len (input), Len (Data[0]))Pass     Else: Result=[] Length=len (Input) forIinchRange (len (data)): Sum=0 forJinchRange (length):#Pow (5,2) identifies 5 squared as 25, takes the square of the distance between two points and accumulatessum = sum + POW (input[j]-data[i][j],2)            #take the square rootsum = POW (sum,0.5) result.append (sum)#Print Resultresult =Np.array (Result)#argsort () Sorts elements based on their values from small to large, returning the subscriptSorteddistindex =np.argsort (Result)#number of tags in the number of K before statisticsClasscount={}         forIinchRange (k): Votelabel=Label[sorteddistindex[i]]## #对选取的K个样本所属的类别个数进行统计            #dict.get (key, Default=none) returns the value of the specified key if the value does not return the default value of None in the dictionary. Classcount[votelabel] = Classcount.get (votelabel,0) + 1## #选取出现的类别次数最多的类别MaxCount =0 forKey,valueinchClasscount.items ():ifValue >Maxcount:maxcount=Value Classes=Keyreturnclasses filename='/home/shutong/jim/crawl/data.csv'Data,label=createdata (filename) input= [1,20]k= 4result=calculatedistance (input,data,label,k)PrintInput,result

Where test data

Input data is: input = [1,20], predict whether its label is a or b?

The final prediction result is: A

The KNN algorithm implemented by Python

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More