The KNN algorithm implemented by Python

Source: Internet
Author: User
Tags pow square root

The KNN algorithm implemented by Python

  Key words: KNN, K-Nearest neighbor (KNN) algorithm, Euclidean distance, Manhattan distance

KNN is classified by measuring the distance between different eigenvalues. The idea is that if a sample is the most similar in the K in the feature space (that is, the nearest neighbor in the feature space), the sample belongs to that category. K is usually an integer that is not greater than 20. In the KNN algorithm, the selected neighbors are the objects that have been correctly categorized. This method determines the category to which the sample is to be divided based on the category of the nearest one or several samples in the categorical decision-making.

In KNN, by calculating the distance between objects as a non-similarity between the objects, to avoid the matching between objects, where the distance is generally used Euclidean distance or Manhattan distance: at the same time, KNN is based on the category of the class K, rather than a single object category decision. These two points are the advantages of the KNN algorithm.

The idea of KNN algorithm is summed up: in the case of training centralized data and tags, the input of test data, the characteristics of the test data and the characteristics of the training set are compared with each other, to find the most similar training concentration of the first k data, The category that corresponds to the test data is the one that has the most occurrences in K data, and its algorithm is described as:

1) Calculate the distance between the test data and each training data;
2) Sort by the increment relation of distance;
3) Select K points with a minimum distance;
4) Determine the occurrence frequency of the category of the first k points;
5) return the category with the highest frequency in the first K points as the predictive classification for the test data.

#Coding:utf-8Importrequests, JSON, time, RE, OS, sys, timeImportUrllib2ImportRandomImportNumPy as NP#set to Utf-8 modeReload (SYS) sys.setdefaultencoding ("Utf-8" )#reading a text file, building a two-dimensional arraydefReaddatafile (filename,format):ifformat:Pass    Else: Format=','List= []    #Remove First spacefilename =Filename.strip ()#determine if a data file exists    ifos.path.isfile (filename):PassFile_object= open (filename,'RB') Lines=File_object.readlines () forLineinchlines:tmp=[] line=Line.strip () forValueinchLine.split (format) [:-1]: Tmp.append (float (value)) Tmp.append (line.split (format) [-1]) list.append (TMP)Else:        Print "%s is not exists"%(filename)returnList#reads text data, splits raw data for features and labels, returns eigenvalues and label valuesdefCreatedata (filename,format=','): Data_label=readdatafile (Filename,format)ifLen (Data_label) >0:label=[] Data= []        #Data_label = [[1,100,123, ' A '],[2,99,123, ' a '],[100,1,12, ' B '],[99,2,23, ' B ']         foreachinchData_label:label.append (each[-1]) data.append (each[:-1])        returnData,label#classification based on input data and test datadefcalculatedistance (input,data,label,k): Classes='Error'        ifLen (data[0]) ==0orLen (label) = =0:Print 'data or label is NULL'        Pass     elifK >len (data):Print "K:%s is out of bounds"%(k)Pass     elifLen (input) <>Len (data[0]):Print "feature variable value is not enough, the number of input variable features is:%s, the number of training features is:%s"%(len (input), Len (Data[0]))Pass     Else: Result=[] Length=len (Input) forIinchRange (len (data)): Sum=0 forJinchRange (length):#Pow (5,2) identifies 5 squared as 25, takes the square of the distance between two points and accumulatessum = sum + POW (input[j]-data[i][j],2)            #take the square rootsum = POW (sum,0.5) result.append (sum)#Print Resultresult =Np.array (Result)#argsort () Sorts elements based on their values from small to large, returning the subscriptSorteddistindex =np.argsort (Result)#number of tags in the number of K before statisticsClasscount={}         forIinchRange (k): Votelabel=Label[sorteddistindex[i]]## #对选取的K个样本所属的类别个数进行统计            #dict.get (key, Default=none) returns the value of the specified key if the value does not return the default value of None in the dictionary. Classcount[votelabel] = Classcount.get (votelabel,0) + 1## #选取出现的类别次数最多的类别MaxCount =0 forKey,valueinchClasscount.items ():ifValue >Maxcount:maxcount=Value Classes=Keyreturnclasses filename='/home/shutong/jim/crawl/data.csv'Data,label=createdata (filename) input= [1,20]k= 4result=calculatedistance (input,data,label,k)PrintInput,result

Where test data

Input data is: input = [1,20], predict whether its label is a or b?

The final prediction result is: A

The KNN algorithm implemented by Python

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.