The introduction of the K-nearest neighbor algorithm is many examples, its Python implementation version is basically from the beginning of machine learning book "Machine learning Combat", although the K-nearest neighbor algorithm itself is very simple, but many beginners to its Python version of the source code understanding is not enough, so this article will be the source of the analysis.
What is the K-nearest neighbor algorithm?
Simply put, the K-nearest neighbor algorithm uses the distance method between different eigenvalue values to classify. So it's a classification algorithm.
Pros: No data input assumptions, insensitive to outliers
Cons: High degree of complexity
All right, just the code, and then the analysis: (This code comes from "machine learning Combat")
def classify0 (Inx, DataSet, Lables, K): datasetsize = dataset.shape[0] Diffmat = Tile (Inx, (datasetsize, 1))-Dat Aset Sqdiffmat = diffmat**2 sqdistance = sqdiffmat.sum (Axis=1) distances = sqdistance**0.5 Sorteddistances = Distances.argsort () classcount={} for I in range (k): label = Lables[sorteddistances[i]] Classcount[label] = classcount.get (label, 0) + 1 sortedclasscount = sorted (Classcount.iteritems (), key= Operator.itemgetter (1), reverse=true) return sortedclasscount[0][0]
The principle of this function is:
There is a collection of sample data, also known as a training set, that has labels for each data in the sample set. After we enter new data with no labels, each feature of the new data is compared with the corresponding feature in the sample set, and then the most similar (nearest neighbor) category label is extracted. In general we only select the first k most similar data in the sample data set. Finally, the most frequently occurring classification is the classification of new data.
The parameter meanings of the CLASSIFY0 function are as follows:
INX: Is the input of new data without a label, expressed as a vector.
DataSet: is a sample set. Represented as a vector array.
Labels: the label of the corresponding Swatch set.
K: That is, the selected top K.
Simple function to produce a sample of data:
Def create_dataset (): group = Array ([[1.0, 1.1], [1.0, 1.1], [0, 0], [0, 0.1]]) labels = [' A ', ' a ', ' B ', ' B '] Return group, Labels
Notice that the array is inside the numpy. We need to implement import in.
From NumPy import *import operator
When we call,
Group,labels = Create_dataset () result = Classify0 ([0,0], group, labels, 3) print result
Obviously, the [0,0] eigenvector is definitely a B, and the above will print B.
Knowing this, beginners should be very unfamiliar with the actual code. No hurry, the text begins!
SOURCE Analysis
Datasetsize = dataset.shape[0]
Shape is the property of an array that describes the "shape", or its dimensions, of a pattern. Like what
In [2]: DataSet = Array ([[1.0, 1.1], [1.0, 1.1], [0, 0], [0, 0.1]]) in [3]: Print Dataset.shape (4, 2)
So, dataset.shape[0] is the number of sample sets.
Diffmat = Tile (Inx, (datasetsize, 1))-DataSet
The tile (A,REP) function constructs an array based on array a, depending on how the second parameter is constructed. Its API description is a bit around, but simple usage can be understood in a few examples.
Let's look at the results of tile (Inx, (4, 1)),
In [5]: Tile (x, (4, 1)) Out[5]: Array ([[[0, 0], [0, 0], [0, 0] , [0, 0]])
You see, 4 expands the number of arrays (originally 1, now 4), and 1 expands on the number of elements per array (originally 2, now two).
To confirm the above conclusion,
In [6]: Tile (x, (4,2)) out[6]: Array ([[0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0]])
And
In [7]: Tile (x, (2,2)) out[7]: Array ([[[0, 0, 0, 0], [0, 0, 0, 0]])
For specific usage of tile, please consult API DOC yourself.
After you get the tile, subtract the dataset. This is similar to the subtraction of a matrix, and the result is still an array of 4 * 2.
In [8]: Tile (x, (4, 1))-datasetout[8]: Array ([[ -1., -1.1], [ -1., -1.1], [0., 0.], [0. ,-0.1]])
In combination with the Euclidean distance, the following code is clear, and the square operation of the result, sum, and prescribe.
Let's look at the summation method,
Sqdiffmat.sum (Axis=1)
which
In []: sqdiffmatout[14]: Array ([[[1] , 1.21], [1. , 1.21], [0. , 0. ], [ 0. , 0.01]])
The result of summing is the sum of rows, which is an array of n*1.
If you want to sum the columns,
Sqldiffmat.sum (axis=0)
Argsort () is sorted in ascending order of the array.
ClassCount is a dictionary, key is a label, and value is the number of times the label appears.
In this way, some specific code details for the algorithm are clear.
Python implementation of K-nearest neighbor algorithm: source code Analysis