K-nearest neighbor algorithm for machine learning in Python

Source: Internet
Author: User
The algorithm we learned today is the KNN nearest neighbor algorithm. KNN is an algorithm for supervised learning classifier classification. Next we will discuss in detail Preface

I recently started to learn machine learning. I found a book about machine learning on the Internet called "machine learning practice". Coincidentally, the algorithms in this book are implemented in the Python language, and I have learned some basic Python knowledge before. Therefore, this book is a breeze for me. Next, let me talk about the actual things.

What is a K-nearest neighbor algorithm?

In short, the K-nearest neighbor algorithm is used to measure the distance between different feature values for classification. Its working principle is: there is a sample data set, also known as a training sample set, and each data in the sample set has tags, that is, we know the relationship between each data in the sample set and its category. after entering new data without tags, we will compare each feature of the new data with the feature corresponding to the data in the sample set, then, the algorithm extracts the classification tags of the most similar data in the sample set. In general, we only select the first k most similar data in the sample dataset, which is the origin of the K-nearest neighbor algorithm name.

Q: Do you create a K-nearest neighbor algorithm for supervised learning or unsupervised learning?

Use Python to import data

From the working principle of the K-nearest neighbor algorithm, we can see that to implement this algorithm for data classification, we need sample data on hand. how can we establish a classification function without sample data. Therefore, the first step is to import the sample data set.

Create a module named kNN. py and write the code:

 from numpy import * import operator  def createDataSet():   group = array([[1.0,1.1],[1.0,1.0],[0,0],[0,0.1]])   labels = ['A','A','B','B']   return group, labels

In the code, we need to import two Python modules: the scientific computing package NumPy and the operator module. The NumPy function library is an independent module in the Python development environment. In most Python versions, the NumPy function library is not installed by default. Therefore, we need to install this module separately.

: Http://sourceforge.net/projects/numpy/files/

There are many examples. here I choose numpy-1.7.0-win32-superpack-python2.7.exe.

Implement K-nearest neighbor algorithms

The concept of K-nearest neighbor algorithm is as follows:

(1) calculate the distance between a point and the current point in a dataset of known classes.

(2) sort by ascending distance

(3) Select k points with the minimum distance from the current point

(4) determine the frequency of occurrence of the category of the first k points

(5) return the category with the highest frequency among the first k points as the prediction category of the current point.

The code for implementing the K-nearest neighbor algorithm in Python is as follows:

 # coding : utf-8 from numpy import * import operator  import kNN group, labels = kNN.createDataSet() def classify(inX, dataSet, labels, k):   dataSetSize = dataSet.shape[0]    diffMat = tile(inX, (dataSetSize,1)) - dataSet   sqDiffMat = diffMat**2   sqDistances = sqDiffMat.sum(axis=1)   distances = sqDistances**0.5   sortedDistances = distances.argsort()   classCount = {}   for i in range(k):     numOflabel = labels[sortedDistances[i]]     classCount[numOflabel] = classCount.get(numOflabel,0) + 1   sortedClassCount = sorted(classCount.iteritems(), key=operator.itemgetter(1),reverse=True)   return sortedClassCount[0][0] my = classify([0,0], group, labels, 3) print my

The calculation result is as follows:

The output result is B, indicating that our new data ([0, 0]) belongs to Class B.

Code details

I believe many of my friends may not understand this code. Next, I will focus on several key points of this function to help readers and myself review this algorithm code.

Parameters of the classify function:

Classification: input vector used for classification
DataSet: a set of training samples.
Labels: label vector
K: K in the k-nearest neighbor algorithm
Shape: an attribute of array, which describes the dimension of a multi-dimensional array.

Tile (arrays, (dataSetSize, 1): converts arrays into two-dimensional arrays. dataSetSize indicates the number of rows after the array is generated, and 1 indicates the multiples of the columns. The entire line of code indicates that each element in the previous two-dimensional array matrix is subtracted from the element value corresponding to the next array, so that the subtraction between matrices is realized, which is simple and easy to admire!

Axis = 1: when the parameter is equal to 1, it indicates the sum of the numbers of rows in the Matrix. if it is equal to 0, it indicates the sum of the numbers between columns.

Argsort (): sorts an array in non-descending order.

ClassCount. get (numOflabel, 0) + 1: This line of code is really exquisite. Get (): This method is used to access dictionary items, that is, to access the numOflabel item. if this item is not available, the initial value is 0. Then add the value of this item to 1. Therefore, only one line of code is required to implement such an operation in Python, which is very simple and efficient.

Remarks

The K-nearest neighbor algorithm (KNN) principle and code implementation are almost the same. The next task is to become more familiar with it and strive to reach the bare knock level.

The above is all the content of this article. I hope you will like it.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.