Machine learning Combat (a) K-Nearest neighbor algorithm

Source: Internet
Author: User

Reprint please indicate source: http://www.cnblogs.com/lighten/p/7593656.html

1. Principle

This chapter introduces the first algorithm of machine learning--k nearest neighbor algorithm (k Nearest Neighbor), also known as KNN. When it comes to machine learning, it is generally thought to be very complex, very advanced content, but in fact, its learning Gate bar is not high, with basic advanced mathematics knowledge (including linear algebra, probability theory) on it, even some of the algorithm high school students can understand. KNN algorithm is a good understanding of the principle of the algorithm, do not need a good mathematical skills, this is a classification algorithm (another big class is the return), belongs to the category of supervised learning (there are non-supervised learning, supervised learning needs to note the training set of marked).

First, the classification as the name implies is to the same thing into different categories, such as people into men, women, books are divided into reference book, textbooks, comic books and so on. To classify a thing, to have the basis of classification, that is why you divide it, sometimes the basis of the division is very accurate, such as gender, but many times by a number of factors to determine which category, and different categories of a single factor may be the intersection, these factors in machine learning is called characteristics, The selection of features also has an effect on the accuracy of the algorithm. The individual uses a set of characteristic data to describe, so that the computer processing is possible, the classification algorithm to do is to determine the characteristics of the individual given the category. There are many ways, KNN takes a simple method to judge: Judging the difference between the training set and the known classification, the smallest difference of the first k training set of the individual is in which category the input is considered to be which classification .

This principle is well understood, such as judging men and women, features only height, weight. Men are usually taller and heavier than women, even though women are taller and weigh more than men of the same rank. So for an input individual, in the case of a known height, weight, and training set samples of height, weight differences, to find the smallest difference in training concentration of k individuals, the K-individuals if the majority of men, the input sample is a man, or a woman. Choose k individuals with the smallest difference is to avoid a small number of unusual samples, because men are higher and heavier than women is only the majority of cases, so choose k weight, credibility is higher. The difference computation generally uses the European distance, namely each characteristic subtracts, asks the square sum, opens the root:

  

Is the definition of the difference, so that the selection of K minimum training set, known as the classification of these training sets, the choice of K training set most of the classification is the new input individual classification.

2. Problems and pros and cons

The principle of KNN algorithm is simple and understandable, but some problems need to be solved in the process of realization. First of all, we need to focus on the calculation of D, KNN chooses d the smallest of the K training sets of individuals, so D is very important to the results of rationality. However, it is obvious from the formula that the size of D is likely to be affected by a single characteristic. Imagine that if the value range of X is 1~10,y in the range of 1000~10000, then the size of D is severely affected by the Y feature, then the effect of x is almost gone. The solution to this problem is to return the numerical value, meaning that regardless of x or Y, according to a reasonable scaling method, so that they fall in the same range, generally choose between 0~1. This method of shrinkage is not difficult to draw, the formula is as follows:

Advantages: High precision, insensitive to outliers, no data input assumptions

Cons: High computational complexity, high spatial complexity

It can be seen from the realization of KNN, its computational cost is higher, each individual needs to compare with all training sets, and the KNN algorithm can not obtain the general characteristics of the specified classification, so it is not suitable for a large number of training sets.

3. Code

The following code from the "Machine Learning Combat" a book, the original book all the code examples can be in the site: here. To download.

def classify0 (InX, DataSet, labels, k):    datasetsize = dataset.shape[0]    Diffmat = Tile (InX, (datasetsize,1))-Dat ASet    Sqdiffmat = diffmat**2    sqdistances = sqdiffmat.sum (Axis=1)    distances = sqdistances**0.5    Sorteddistindicies = Distances.argsort ()         classcount={} for              I in range (k):        Voteilabel = labels[ Sorteddistindicies[i]]        Classcount[voteilabel] = Classcount.get (voteilabel,0) + 1    sortedclasscount = sorted ( Classcount.iteritems (), Key=operator.itemgetter (1), reverse=true)    return sortedclasscount[0][0]def CreateDataSet ():    group = Array ([[1.0,1.1],[1.0,1.0],[0,0],[0,0.1]])    labels = [' A ', ' a ', ' B ', ' B ']    Return group, Labels

Machine learning Combat (a) K-Nearest neighbor algorithm

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.