Classification algorithm-K neighbor

Last Update:2018-08-01 Source: Internet

Author: User

Tags diff in python

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Recently in the "Machine learning actual Combat" this book, because I really want to learn more about machine learning algorithms, coupled with want to learn python, in the recommendation of a friend chose this book to learn.

One. An overview of the K-Nearest neighbor algorithm (KNN)

The simplest initial-level classifier is a record of all the classes corresponding to the training data, which can be categorized when the properties of the test object and the properties of a training object match exactly. But how is it possible that all the test objects will find the exact match of the training object, followed by the existence of a test object at the same time with more than one training object, resulting in a training object is divided into multiple classes of the problem, based on these problems, resulting in KNN.

KNN is classified by measuring the distance between different eigenvalues. The idea is that if a sample has most of the K-similarity in the feature space (that is, the nearest neighbor in the feature space) to a category, the sample belongs to that category, where k is usually an integer that is not greater than 20. In the KNN algorithm, the selected neighbors are the objects that have been correctly categorized. This method determines the category to which the sample is to be divided based on the category of the nearest one or several samples in the categorical decision-making.

Here's a simple example of how a green circle is to be given a red triangle or a blue quad, as shown below. If k=3, because the red triangle is the proportion of 2/3, the green circle will be given the red triangle that class, if k=5, because the blue four-square scale is 3/5, so the green circle is given the blue four-square class.

It is also shown that the results of KNN algorithm depend largely on the choice of K.

In KNN, by calculating the distance between objects as a non-similarity between the objects, to avoid the matching between objects, where the distance between the general use of Euclidean distance or Manhattan distance:

At the same time, KNN makes decisions based on the dominant category in K-objects, rather than a single object-category decision. These two points are the advantages of the KNN algorithm.

The following is a summary of the KNN algorithm: in the training of the data and the label is known, the input test data, the characteristics of the test data and training set the corresponding characteristics of the comparison, to find the most similar training focus on the first K data, The category that corresponds to the test data is the one that has the most occurrences in K data, and its algorithm is described as:

1) Calculate the distance between the test data and each training data;

2) Sort by the increment relation of distance;

3) Select K points with a minimum distance;

4) Determine the occurrence frequency of the category of the first k points;

5) return the category with the highest frequency in the first K points as the predictive classification for the test data.

Two. Python implementation

First of all, it should be explained that I use python3.4.3, there are some usage and 2.7 or some of the discrepancy.

Establish a knn.py file to verify the feasibility of the algorithm, as follows:

#coding: Utf-8 from

numpy import *
import operator

# #给出训练数据以及对应的类别
def createdataset ():
    group = Array ([[[1.0,2.0],[1.2,0.1],[0.1,1.4],[0.3,3.5]])
    labels = [' A ', ' a ', ' B ', ' B ']
    return group,labels

# # # Classification by KNN
def classify (input,datase t,label,k):
    datasize = dataset.shape[0]
    # # # #计算欧式距离
    diff = Tile (Input, (datasize,1))-DataSet
    sqdiff = diff * * 2
    squaredist = SUM (Sqdiff,axis = 1) # # #行向量分别相加 to get a new line vector
    dist = squaredist * * 0.5
    
    # #对距离进行排序
    sorteddistindex = Argsort (dist) # #argsort () sorts the elements from large to small, returning the subscript

    classcount={} for
    I in range (k):
        Votelabel = label[sorteddistindex[i]]
        # # #对选取的K个样本所属的类别个数进行统计
        Classcount[votelabel] = Classcount.get (votelabel,0) + 1
    # # #选取出现的类别次数最多的类别
    maxCount = 0
    for key, Value in Classcount.items ():
        if value > MaxCount:
            maxCount = value
            classes = key

    return classes

Next, enter the following code in the command-line window:

#-*-coding:utf-8-*-
Import sys
sys.path.append ("... File path ... ")
import KNN from
numpy import *
dataset,labels = knn.createdataset ()
input = Array ([1.1,0.3])
K = 3
output = knn.classify (input,dataset,labels,k)
print ("Test data:", input, "categorical result:", output)

The result after carriage return is:

The test data are: [1.1 0.3] classified as: A

The answer is in line with our expectations, to prove the accuracy of the algorithm, it is necessary to deal with complex problems to verify, followed by a separate explanation.

This is the first time a small program with Python, bound to encounter a variety of problems, in the course of programming debugging encountered the following problems:

1 There is a problem importing the. py file path, so you need to add the following code at the beginning: Import sys

Sys.path.append ("File path") so there is no problem with the wrong path;

2 in the Python hint code there is a problem, be sure to correct it in time, correct after saving and then execute the command line, this is not the same as MATLAB, so in Python it is best to hit the code at the same time in the command line for a period of validation;

3 The function name must be written correctly when calling the file, otherwise it will appear: ' Module ' object has no attribute ' creatdataset ';

4 ' int ' object has no attribute ' kclassify ', this problem arises because before I tell the file save named k.py, in the execution

The phrase  output = k.classify (input,dataset,labels,k) will go wrong. According to the idea of functional programming, each function can be viewed as a variable and the k is assigned, and the problem occurs when the k.py is called. 
 
 
 Three MATLAB implementation 
 has been using MATLAB to do some optimization of the clustering algorithm, followed by some of the common arithmetic of the number of modules, for other algorithms, really did not get started, the foundation is still in, the idea is still, of course, to do a hand-made, Also do not want to learn python at the same time on the Matlab gradually unfamiliar bar, walk and stop, stop is very important. 
 First, create the knn.m file, as follows:

Percent KNN
Clear all
CLC percent
data
traindata = [1.0,2.0;1.2,0.1;0.1,1.4;0.3,3.5];
Trainclass = [1,1,2,2];
TestData = [0.5,2.3];
K = 3;

Percent distance
row = size (traindata,1);
col = size (traindata,2);
Test = Repmat (testdata,row,1);
dis = zeros (1,row);
For i = 1:row
    diff = 0;
    For j = 1:col
        diff = diff + (test (I,J)-Traindata (I,j)). ^2;
    End
    dis (1,i) = diff.^0.5;
End percent of

sort
Jointdis = [Dis;trainclass];
sortdis= sortrows (Jointdis ');
Sortdisclass = Sortdis ';

Percent of find
class = sort (2:1:k);
member = unique (class);
num = size (member);

max = 0;
For i = 1:num
    count = find (class = = Member (i));
    If Count > Max
        max = count;
        Label = Member (i);
    End
End

disp (' The final classification result is: ');

After running the result is that the final classification result is: 2. The same as the expected result.

Three combat

Before, a simple verification of KNN, today we use KNN to improve the effect of dating sites, personal understanding, this problem can also be translated into other such as the various sites to cater to the preferences of customers to make recommendations, of course, today's example of the function is really limited.

In this case, according to the date data collected by a person, according to the main sample characteristics and the resulting classification, some unknown categories of data classification, roughly.

I am using Python 3.4.3, first create a file, such as date.py, the specific code is as follows:

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More