Lazy Learning: Simply store the data, wait until a test tuple is given, and classify it according to the similarity of the stored tuples. The KNN (k nearest Neighbor) Classification method was proposed in the 1950s, because of the computationally intensive algorithm, it was gradually applied after the 60 's with the increase of computing power.
KNN is based on analogy learning, which represents a given test tuple as a point in an n-dimensional space, and n represents the number of attributes. Then use some distance metric to find the K training tuples closest to a given test tuple, to count the categories of this k training tuple, and to return categories with a number of categories as unknown test tuples.
The commonly used distance measurement is Euclidean distance, also known as the second norm. At the same time, in order to reduce the influence of the range of different attribute values on the distance calculation, the maximum-minimum normalization is used to transform the attribute values to the [0,1] interval.
According to the above characteristics, the KNN algorithm is best applied to the numerical attribute, for the ordinal attribute can be transformed to a numerical type, the nominal attribute normalization is also better, but the two-dollar attribute may not be very good. Main advantages and Disadvantages:
Advantages: High accuracy, insensitive to noise, no data input assumptions required
Cons: High complexity of time and space, need to determine K value (k value determination may require a lot of experience)
Here is the implementation of the KNN algorithm in the book "Machine Learning Combat" to actually classify the data of a spam message. This data contains 3,065 training samples and 1536 test samples. Each sample uses 57 features, with numeric data and a two-tuple attribute, and the class label {0,1},0 means not spam, and 1 means spam.
The first is to read the data from the file:
def loaddataset (FP): if Os.path.exists (fp): TRY:FH = open (FP, ' r ') Rtntrainset = Zeros ((T Rainset_num, features_num)) Trainlabel = [] Rtntestset = Zeros ((testset_num, Features_num)) TestLabel = [] i = 0 for line in fh:line = Line.strip () terms = Li Ne.split (', ') if I < trainset_num:rtntrainset[i,:] = Terms[0:features_num] Trainlabel.append (int (terms[features_num])) i + = 1 else:rt Ntestset[i-trainset_num,:] = Terms[0:features_num] testlabel.append (int (terms[features_num])) i + = 1 except Exception, Msg:print ' An unexcepted error occur: ', MSG finally: Fh.close () return Rtntrainset,trainlabel,rtntestset,testlabel else:print "The data file does not Exists! " Return NOne
Each row of the file represents a sample, each feature is separated by commas, and the last field is the category label. This is used by NumPy, a third-party library that stores data in a matrix.
Then the normalized processing of the data is normalized to the [0,1] interval.
Def normalize (DS): minvals = ds.min (0) maxvals = Ds.max (0) ranges = maxvals-minvals Normds = zeros (shap E (DS)) n = ds.shape[0] Normds = Ds-tile (Minvals, (n, 1)) Normds = Normds/tile (ranges, (n, 1)) return n Ormds
Finally, the implementation of the classifier is to input each sample to be tested, using the training sample to calculate the nearest K neighbors and then derive the most categories.
Def knnclassify (ds, labels, K, inputx): dssize = ds.shape[0] diff = Tile (inputx, (dssize, 1))-ds Sqdiff = di FF * * 2 sqdist = sqdiff.sum (axis = 1) dist = sqdist * * 0.5 sorteddist = Dist.argsort () ClassCount = {}
for I in range (k): Votedlabel = labels[sorteddist[i]] Classcount[votedlabel] = classcount.get (Votedlabel, 0) + 1 sortedclasscount = sorted (Classcount.iteritems (), key = Operator.itemgetter (1), reverse = True) return sortedclasscount[0][0]
1536 test data were classified, and different k values were used to obtain the accuracy of different k values, and the following graphs were plotted:
It can be seen that when K is 10, the accuracy of 75% is the highest, the result is not too good, the main reason is that there are some binary attributes of the 57 characteristics of spam, when the use of KNN to classify the calculation distance will have a greater impact.
Finally, using k=10 to test the training set, the following results are obtained.
Train accuracy:0.9282
Machine Learning Combat--KNN classifier