(Reprinted please indicate the source: http://blog.csdn.net/buptgshengod)
1. background in the future, bloggers will regularly update machine learning algorithms and Their python implementations on a weekly basis. The algorithm we learned today is the KNN nearest neighbor algorithm. KNN is an algorithm for supervised learning classifier classification. What is supervised learning and what is unsupervised learning. Supervised Learning is the algorithm we use when we know the target vector. unsupervised learning is used when we do not know the specific target variable. Supervised Learning is divided into Classifier algorithms and Regression Algorithms Based on the type (discrete or continuous) of the target variables. K-Nearest Neighbor. K is a constraint variable in the algorithm. The general idea of the entire algorithm is relatively simple, that is, to regard the feature values of a dataset as vectors. We give the program a set of feature values. If there are three feature values, we can think of them as (x1, x2, x3 ). The original feature values of the system can be seen as a group of (y1, y2, y3) vectors. By finding the distance between two vectors, we can find the first k feature value pairs with the shortest distance of y. The target variable corresponding to these y values is the classification of the x feature value. Formula:
2. python-based numpy is a mathematical computing library of python. It is mainly used for some matrix operations. We will use it a lot here. This section describes some functions used in the code.
Arry: the array representation provided by numpy. For example, four rows and two columns of numbers in this example can be entered as follows:
Group = array ([[9,400], [40,300], [], [])
Shape: Show (row, column) Example: shape (group) =)
Zeros: list an empty matrix in the same format, for example, zeros (group) = ([[0, 0], [0, 0], [0, 0])
The tile function is located in the python module numpy. lib. shape_base. Its function is to repeat an array. For example, tile (A, n) is used to repeat array A n times to form A new array.
Sum (axis = 1) matrix adds each vector row
3. Dataset
4. The code is divided into three functions:
Create a dataset:
CreateDataset
from __future__ import divisionfrom numpy import *import operatordef createDataset(): group=array([[9,400],[200,5],[100,77],[40,300]]) labels=['1','2','3','1'] return group,labels
Data normalization:
AutoNorm
def autoNorm(dataSet): minVals = dataSet.min(0) maxVals = dataSet.max(0) ranges = maxVals - minVals normDataSet = zeros(shape(dataSet)) m = dataSet.shape[0] normDataSet = dataSet - tile(minVals, (m,1)) #print normDataSet normDataSet = normDataSet/tile(ranges, (m,1)) #element wise divide # print normDataSet return normDataSet, ranges, minVals
Classification function:
Classify
def classify(inX, dataSet, labels, k): dataSetSize = dataSet.shape[0] diffMat = tile(inX, (dataSetSize,1)) - dataSet sqDiffMat = diffMat**2 sqDistances = sqDiffMat.sum(axis=1) distances = sqDistances**0.5 sortedDistIndicies = distances.argsort() classCount={} for i in range(k): voteIlabel = labels[sortedDistIndicies[i]] classCount[voteIlabel] = classCount.get(voteIlabel,0) + 1 sortedClassCount = sorted(classCount.iteritems(), key=operator.itemgetter(1), reverse=True) return sortedClassCount[0][0]
5. Display Results
6. Download Code