K-Nearest Neighbor algorithm

Source: Internet
Author: User
Tags lua

From today, I will share with you my notes and comments on the book "Machine Learning in Action". I will be the source of detailed comments, this is my own learning process, but also want to help in this way to learn the children's shoes a way.

K-Nearest Neighbor algorithm definition

The K-Nearest neighbor (K-nearest NEIGHBOUR,KNN) algorithm uses the method of measuring the distance between different eigenvalue values to classify. The idea of this approach is that if a sample is in the K most similar in the feature space (that is, the nearest neighbor in the feature space) Most of the samples belong to a category, then the sample belongs to that category.

In the official words, the so-called K-nearest neighbor algorithm, that is, given a training data set, the new input instance, in the training data set to find the nearest neighbor of the K-instance (that is, the K neighbors above), the K-instance of the majority of a class, the input instance is classified into this class.

Advantages and disadvantages of K-Nearest neighbor algorithm

This refers to the original version of Machine learing in action:
Pros:high accuracy, insensitive to outliers, no assumptions about data
Cons:computationally expensive, requires a lot of memory
Works with:numeric values, nominal values

K-Nearest Neighbor algorithm flow

For each point in the dataset of the Unknown category property, do the following:
(1) Calculate the distance between the point in the data set of the known category and the current point;
(2) Sorting in ascending order of distance;
(3) Select K points with the minimum distance from the current point;
(4) Determine the frequency of the category in which the first K points are present
(5) Return to the category with the highest frequency of the first K points as the forecast classification of the current point

Implementation of K-nearest neighbor algorithm

The KNN algorithm is implemented according to the above algorithm flow.

Python prep Knowledge

Let's talk about some of the numpy in the program.
1. Tile
Tile (A, reps)
Construct an array by repeating A, the number of times given by Reps.
The array a expands reps times. A bit abstract let's look at examples:

>>> a = Np.array ([0,1,2]) >>> Np.tile (A,2) Array ([0,1,2,0,1,2]) >>> Np.tile (A, (2,2)) Array ([[0, 1, 2, 0, 1, 2], [0, 1, 2, 0, 1, 2]) >>> Np.tile (A, (2,1,2)) Array ([[[0, 1, 2, 0, 1, 2],[[0, 1, 2, 0, 1, 2]]) >>> B = Np.array ([ [1, 2], [3, 4]]) >>> Np.tile (b,2) Array ([[1, 2, 1, 2], [3, 4, 3, 4]]) >>> Np.tile (b, (2,1)) Array ([ [1, 2], [3, 4], [1, 2], [3, 4]])

In fact, if you treat a as a number, the resulting array type is the reps type, and the elements that form the array are a.
2. Sum
Sum (A, Axis=none, Dtype=none, Out=none, Keepdims=false)
Sum of array elements over a given axis.

Axis or axes along which a sum is performed.
The default ( axis = None ) is perform a sum of the dimensions of the input array. Could be axis negative, in which C ASE it counts from the last to the first axis.

>>> np.sum ([0.5 , Span class= "Hljs-number" >1.5 ]) 2.0  >>> np.sum ([ 0.5 , 0.7 , 0.2  , 1.5 ], dtype=np.int32) 1  >>> np.sum ( Span class= "Hljs-string" >[[0, 1], [0, 5]] ) 6  >>> np.sum ([[0, 1], [0, 5]] , Axis=0 ) Array ([0 , 6 ]) >>> np.sum ([[0, 1], [0, 5]] , Axis=1 ) Array ([1 , 5 ])

For a two-dimensional array, axis=0 is the column addition, Aixs=1 is the row addition
3. The difference between sort, sorted, and Argsort
Sort is only a built-in function of the list type and does not apply to other non-list-type sequences. Sort in place, and does not return the sorted object.
Sorted is a built-in function of all types, returning the sorted object without changing the original object.
Argsort, the function that belongs to NumPy returns the subscript of the sorted element in the original object.
For example:

>>> x = np.array([36015])>>> x.argsort()array([23041])
Experimental data

"Machine Learning in action" experimental data and source code download
Here the experimental data is a text file, a total of 1000 lines, each line of four items meaning is: 1. Number of frequent flyer miles per year 2. Percentage of time spent playing video games 3. Ice cream liter book consumed per week 4. How much you like the person (including: dislike, charismatic, charismatic)
The data is broadly as follows:

40920   8.326976    0.953952Largedoses14488   7.153469    1.673904Smalldoses26052   1.441871    0.805124Didntlike75136   13.147394   0.428964Didntlike38344   1.669788    0.134296Didntlike72993   10.141740   1.032955Didntlike35948   6.830792    1.213192Largedoses42666   13.276369   0.543880Largedoses67497   8.631577    0.749278Didntlike35483   12.273169   1.508053Largedoses50242   3.723498    0.831917Didntlike63275   8.385879    1.669485Didntlike5569    4.875435    0.728658Smalldoses......
Python Source code

The Classify0 function is the source code implementation of the K-nearest neighbor algorithm, the File2matrix function is used to transfer data from the file to you and then to the CLASSIFY0 function for processing.

# @param InX data to classify (a one-dimensional array)# @param DataSet of a known class (a two-dimensional array)# @param labels category labels for datasets of known types (one-dimensional arrays)# parameter K in K-nearest neighbor algorithm @param# @return def classify0(InX, DataSet, labels, k):    # Ndarray.shape    # The dimensions of the array.    # This is a tuple of integers indicating the size of the array in each dimension.    # for a matrix with n rows and m columns, shape'll be (n, m).    # The length of the shape tuple is therefore the rank, or number of dimensions, Ndim.Datasetsize = dataset.shape[0]# Number of rows to get a dataset    # numpy.tile = Tile (A, reps)    # Construct An array by repeating A the number of times given by repsDiffmat = Tile (InX, (Datasetsize,1))-DataSet# Expand the Inx to Datasetsize rows and subtract the dataset, which is the new two-dimensional matrix that expands the difference between the two-dimensional array and the number in the same location of the datasetSqdiffmat = Diffmat * *2Sqdistances = sqdiffmat.sum (axis =1)# Computes the and of the matrix rowsdistances = sqdistances * *0.5 #开方运算, this is the calculated distance .Sorteddistindicies = Distances.argsort ()# returns the subscript position after the small to large sortClassCount = {} forIinchRange (k): Voteilabel = Labels[sorteddistindicies[i]]#依次获得第k小的距离的数据所对应的标签Classcount[voteilabel] = Classcount.get (Voteilabel,0) +1# 0 In the Get function is the default value, which counts the number of occurrences of each label    # Both List.sort () and sorted () accept a reverse parameter with a Boolean value. This was using to flag descending sorts    # The first parameter in this sorted is the collection to sort, the key is sorted by what, and this is the second field of the ClassCount dictionary type, which is sorted by the number of labelsSortedclasscount = sorted (Classcount.iteritems (), key = Operator.itemgetter (1), reverse =True)# from large to small as the number of labels appears    returnsortedclasscount[0][0]# returns the label with the most occurrences
# @param filename File name to save data to# @return Return data collection and label collection def file2matrix(filename):FR = open (filename) contents = Fr.readlines () NumberOfLines = len (contents) Returnmat = Zeros ((NumberOfLines,3)) Classlabelvector = [] index =0     forLineinchContents# Return A copy of the string with leading and trailing whitespace removedline = Line.strip () Listfromline = Line.split (' \ t ') Returnmat[index,:] = listfromline[0:3] Classlabelvector.append (listfromline[-1]) Index + =1    returnReturnmat, Classlabelvector

Copyright NOTICE: This article for Bo Master original article, without Bo Master permission not reproduced.

K-Nearest Neighbor algorithm

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.