Introduction to K-Proximity algorithm:
K-Neighbor algorithm is to calculate the distance between the data to be classified and the sample data, get the first k (usually not more than 20) and the most similar data to be classified data, then classify the K data, and classify the data to the category with the most occurrences.
It is to be noted that
1, sometimes need to be based on the characteristics of the data in the classification of contribution size, weighted;
2, if the characteristics of the classification of the same contribution, and the difference between the characteristics of a large number of large numbers will affect the classification results, the characteristics of the data need to be normalized treatment. In the data processing, normalization is commonly used pretreatment means, the method of normalization is also more, reproduced here a blog about normalization, to the normalization of the reader can go. Re-discussion on normalization in machine learning (normalization method)
The linear normalized Python implementation is as follows:
'
normalized
'
def autonorm (dataSet):
minvalues = dataset.min (0)
maxvalues = Dataset.max (0)
ranges = maxvalues-minvalues
normdataset = zeros (Shape (dataSet))
m = dataset.shape[0]
Normdataset = Dataset-tile (Minvalues, (M, 1))
Normdataset = Normdataset/tile (ranges, (M, 1)) return
normdataset, ranges, MI Nvalues
Python3 implements the simplest K-proximity algorithm:
From numpy Import * import operator ' group: Sample Data Labels: sample data corresponding category label ' Def CreateDataSet (): group = array ([1.0, 1. 1], [1.0, 1.0], [0, 0], [0, 0.1]]) labels = [' A ', ' a ', ' B ', ' B '] return group, labels ' ' Define K-proximity algorithm ' def classif Y0 (InX, DataSet, labels, k): ":p AramInX: Data to be sorted:p AramDataSet: Sample Data:p AramLabels: Sample data labels:p AramK: Select the number of points with the smallest distance: return: Categories of data to be classified ' "' compute Euclidean distance ' #shape: function in NumPy, get array, tuple datasetsize = dataset.shape[0] # til Functions in E:numpy, constructing array diffmat = Tile (InX, (datasetsize, 1))-DataSet Sqdiffmat = Diffmat * * 2 sqdistances = Sqdiff Mat.sum (axis=1) distances = sqdistances * * 0.5 #argsort: Functions in NumPy, sorting and extracting indexes sorteddistindicies = DISTANCES.ARGSO RT () ClassCount = {} ' ' before K count category returns to category Data category ' For I in range (k): Voteilabel = Labels[s Orteddistindicies[i]] Classcount[voteilabel] = classcount.get (Voteilabel, 0) + 1 Sortedclasscount = sorted (cl Asscount.items (), Key=operator.itemgetter (1), reverse=true) #items ():p ython3 syntax, python2 need to use: Iteritems () return Sortedclasscount[0][0] if __name__ = = "__main__": Group, labels = CreateDataSet () print (Classify0 ([0, 0], group, Labels, 3))
You can read the sample data stored in the text by using the code:
"
take data from text
" '
def File2matrix (filename,num):
'
Note: The last behavior tag data for the document
:p Aram FileName: file name
:p Aram num: Sample contains the number of features
:return: Sample array and label
'
fr = open (filename)
arrayoflines = Fr.readlines () #将文本读取列表, note the difference from ReadLine ()
numberoflines = Len (arrayoflines) # Gets the number of lines of text
Returnmat = Zeros ((numberoflines, num))
classlabelvector = []
index = 0 for
line in Arrayoflines: Line
= Line.strip () #去回撤
listfromline = Line.sptrit (' \ t ') #用 \ t split data into list
of elements Returnmat[index,:] = Listfromline[0:num]
classlabelvector.append (int (listfromline[-1))
index = 1
Return Returnmat, Classlabelvector