"Machine learning Combat" study notes: K-Nearest neighbor algorithm implementation

Source: Internet
Author: User

The main learning and research tasks of the previous semester were pattern recognition, signal theory, and image processing, which in fact had more or less intersection with machine learning. As a result, we continue to read machine learning in depth and watch Stanford's machine learning program. In this process, because of the requirements of the future group project, the need to contact Python, so chose the "machine Learning Combat" this book, while reference materials and videos together to learn. In fact, the book's theoretical research is not deep enough, only to practice Python and verify some of the famous machine learning algorithms of the reference.

Before introducing the K-Nearest neighbor algorithm, the machine learning algorithm is simply classified and sorted: In short, machine learning is divided into two major categories, supervised learning (supervised learning) and unsupervised learning (unsupervised learning). Supervised learning can be divided into two categories: classification (classification.) and regression (regression), the task of classification is to classify a sample into a known category, each sample of the class information in training needs to be given, such as face recognition, behavior recognition, target detection, etc. belong to classification. The task of regression is to predict the price movement by predicting a value, such as a given housing market data (area, location, etc.). Unsupervised learning can also be classified into two categories: clustering (clustering) and density estimation (density estimation), clustering is a bunch of data into a weak dry group, no category information; Density estimation is the statistical parameter information that estimates a bunch of data to describe data, such as deep learning RBM.

As the beginning of the chapter, the author introduces a simple and easy-to-understand K-nearest neighbor (KNN) algorithm. This algorithm is a method of nonparametric estimation, which is put together with Parzen window estimation in pattern recognition book. This kind of method is used to deal with any kind of probability distribution without having to consider the parameter form of probability density beforehand.

For more information on Nonparametric estimation methods: http://blog.csdn.net/liyuefeilong/article/details/45274325

Before using the K-nearest neighbor algorithm, you need to understand the pros and cons of the algorithm:
Advantages: High precision, insensitive to abnormal data, no data input assumptions
Cons: Both time and space complexity are high
Applicable data types: numeric and identification type

The basic idea of the basic K-mean algorithm is: first you need to have a set of training samples, also known as the training sample set. The classification categories for each sample in the collection are known, that is, we know the correspondence between each data and the owning category. At this point, you can enter a data to be predicted, compare each feature of the new data with the characteristics of each data in the training sample set, and, based on a comparison (such as Euclidean distance), find k a training sample that is the most similar to the data to be predicted, to see what kind of training samples they belong to, and finally to uphold the " Most dominant principle: Since the predicted data is close to many of the class A training samples, and the other classes are not very similar, then the prediction data can be judged as a class.

In the above description, the use of Euclidean distance as the predicted data and training samples of the comparison results, that is, assuming the test sample is a , and the first xi sample in the training sample set i , the test sample and training samples have n a characteristic attribute, The Euclidean distance between the test sample and the training sample is defined as:

Typically, the algorithm's k is a positive integer that is not greater than 20. When k=7 , according to the Euclidean distance formula of the above calculation of 78 of the most recent training samples to be predicted, in these 7 instances, a classification occurs most times, the forecast data is divided into the class.

The following is a written code according to the "machine learning Combat", the direct implementation of the K-nearest neighbor algorithm, because of Python is not very familiar with, so the algorithm implementation of KNN is mainly in the use of numpy, but after all, this is a good opportunity to learn python, the following is the code implementation of the algorithm:

#-*-Coding:utf-8-*-"" " Created on Sat 14:36:02 2015input:data:vector of test sample (1xN) sample:size m data set of KN Own vectors (NxM) Labels:labels of the sample (1xM vector) K:number of neighborsoutput:the class label @author: peng__000 "" " fromNumPyImport*Importoperator fromOsImportListdir# Training SamplesSample = Array ([[1.0,1.1], [1.0,1.0], [0,0], [0,0.1]])# The labels of sampleslabel = [' A ',' A ',' B ',' B '] def classify(data, sample, label, K):SampleSize = sample.shape[0]# Number of rows in the training sample setDatamat = Tile (data, samplesize,1))#将data扩展到和训练样本集sample一样的行数Delta = (datamat-sample) * *2Distance = (Delta.sum (axis =1))**0.5  # above three steps to calculate Euclidean distanceSorteddist = Distance.argsort ()# to sort Euclidean distance vectorsClassCount = {}# The following operation gets the label of the nearest K-sample     forIinchRange (k): Votedlabel = Label[sorteddist[i]] Classcount[votedlabel] = Classcount.get (Votedlabel,0) +1result = sorted (Classcount.iteritems (), key = Operator.itemgetter (1), reverse =True)returnresult[0][0]PrintClassify ([Ten,0], sample, label,3)# Test

This short code has no complicated operations in addition to some matrix operations and simple sorting operations.

After the simple implementation of the K-nearest neighbor algorithm, the next need to apply the algorithm to other scenarios, according to the book "Machine Learning Combat" in the tutorial, the main test for dating site classification and handwriting recognition system.

In general, the K-nearest neighbor algorithm is the simplest and most effective classification algorithm in the field of pattern recognition and machine learning, the disadvantage is that the computation speed is not satisfactory, and all the data sets must be saved in the process of the algorithm execution, if the training data set is very large, the performance of the algorithm will be greatly reduced. Another drawback of the K-neighbor algorithm is that it is not possible to give the infrastructure information for any data, so it is not possible to know what the average instance sample and typical instance samples have. Through the experimental process can also be found that the parameters of the selection and adjustment should be based on the actual situation of several experiments to achieve a more ideal effect.

Copyright NOTICE: This article for Bo Master original article, without Bo Master permission not reproduced.

"Machine learning Combat" study notes: K-Nearest neighbor algorithm implementation

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.