Machine Learning 3, machine learning
K-Nearest Neighbor Algorithm for machine learning in Python
Preface
I recently started to learn machine learning. I found a book about machine learning on the Internet called "machine learning practice". Coincidentally, the algorithms in this book are implemented in the Python language, and I have learned some basic Python knowledge before. Therefore, this book is a breeze for me. Next, let me talk about the actual things.
What is a K-Nearest Neighbor Algorithm?
In short, the K-Nearest Neighbor algorithm is used to measure the distance between different feature values for classification. Its working principle is: there is a sample data set, also known as a training sample set, and each data in the sample set has tags, that is, we know the relationship between each data in the sample set and its category. After Entering new data without tags, We will compare each feature of the new data with the feature corresponding to the data in the sample set, then, the algorithm extracts the classification tags of the most similar data in the sample set. In general, we only select the first k most similar data in the sample dataset, which is the origin of the K-Nearest Neighbor Algorithm name.
Q: Do you create a K-Nearest Neighbor Algorithm for supervised learning or unsupervised learning?
Use Python to import data
From the working principle of the K-Nearest Neighbor algorithm, we can see that to implement this algorithm for data classification, we need sample data on hand. How can we establish a classification function without sample data. Therefore, the first step is to import the sample data set.
Create a module named kNN. py and write the code:
1 from numpy import *2 import operator3 4 def createDataSet():5 group = array([[1.0,1.1],[1.0,1.0],[0,0],[0,0.1]])6 labels = ['A','A','B','B']7 return group, labels
In the code, we need to import two Python modules: the scientific computing package NumPy and the operator module. The NumPy function library is an independent module in the Python development environment. In most Python versions, the NumPy function library is not installed by default. Therefore, we need to install this module separately.
Download Stamp: NumPy
There are many examples. Here I choose numpy-1.7.0-win32-superpack-python2.7.exe.
Implement K-Nearest Neighbor Algorithms
The concept of K-Nearest Neighbor Algorithm is as follows:
(1) calculate the distance between a point and the current point in a dataset of known classes.
(2) sort by ascending distance
(3) Select k points with the minimum distance from the current point
(4) determine the frequency of occurrence of the category of the first k points
(5) return the category with the highest frequency among the first k points as the prediction category of the current point.
The code for implementing the K-Nearest Neighbor Algorithm in Python is as follows:
1 # coding : utf-8 2 3 from numpy import * 4 import operator 5 import kNN 6 7 group, labels = kNN.createDataSet() 8 9 def classify(inX, dataSet, labels, k):10 dataSetSize = dataSet.shape[0] 11 diffMat = tile(inX, (dataSetSize,1)) - dataSet12 sqDiffMat = diffMat**213 sqDistances = sqDiffMat.sum(axis=1)14 distances = sqDistances**0.515 sortedDistances = distances.argsort()16 classCount = {}17 for i in range(k):18 numOflabel = labels[sortedDistances[i]]19 classCount[numOflabel] = classCount.get(numOflabel,0) + 120 sortedClassCount = sorted(classCount.iteritems(), key=operator.itemgetter(1),reverse=True)21 return sortedClassCount[0][0]22 23 my = classify([0,0], group, labels, 3)24 print my
The calculation result is as follows:
The output result is B, indicating that our new data ([0, 0]) belongs to Class B.
Code details
I believe many of my friends may not understand this code. Next, I will focus on several key points of this function to help readers and myself review this algorithm code.
Parameters of the classify function:
- Classification: input vector used for classification
- DataSet: a set of training samples.
- Labels: Label Vector
- K: K in the k-Nearest Neighbor Algorithm
Shape: an attribute of array, which describes the dimension of a multi-dimensional array.
Tile (arrays, (dataSetSize, 1): converts arrays into two-dimensional arrays. dataSetSize indicates the number of rows after the array is generated, and 1 indicates the multiples of the columns. The entire line of code indicates that each element in the previous two-dimensional array matrix is subtracted from the element value corresponding to the next array, so that the subtraction between matrices is realized, which is simple and easy to admire!
Axis = 1: When the parameter is equal to 1, it indicates the sum of the numbers of rows in the Matrix. If it is equal to 0, it indicates the sum of the numbers between columns.
Argsort (): sorts an array in non-descending order.
ClassCount. get (numOflabel, 0) + 1: This line of code is really exquisite. Get (): This method is used to access dictionary items, that is, to access the numOflabel item. If this item is not available, the initial value is 0. Then add the value of this item to 1. Therefore, only one line of code is required to implement such an operation in Python, which is very simple and efficient.
Remarks
The K-Nearest Neighbor Algorithm (KNN) principle and code implementation are almost the same. The next task is to become more familiar with it and strive to reach the bare knock level.
Come on !!!