Introduction and implementation of machine learning KNN method (Dating satisfaction Statistics) _ Machine learning

Source: Internet
Author: User
Tags square root
Experimental purposes

Recently intend to systematically start learning machine learning, bought a few books, but also find a lot of practicing things, this series is a record of their learning process, from the most basic KNN algorithm began; experiment Introduction

Language: Python

GitHub Address: LUUUYI/KNN
Experiment Step 1) Principle Introduction

K-Nearest Neighbor algorithm is a basic classification and regression method. K-Nearest Neighbor algorithm: that is, given a training dataset, the new input instance, in the training dataset to find the nearest to the example of the K-example, the K-example of the majority of the class, the input instance into this class.

For example, let's look at the following figure:

If k=3, the nearest 3 dots of green dots are 2 red small triangles and a small blue square, few from the majority, based on the statistical method, the decision green to classify points belong to a red triangle class.

If k=5, the nearest 5 neighbors of the green dot are 2 red triangles and 3 blue squares, or a few from the majority, based on the statistical method, the green is judged to be a blue square for this classification point. That is to say, this is the most basic idea of a minority that obeys the majority, but it can also be seen that when k values are different, the results for green dots are different. So in practical application, for a precise classification purpose, how to select a K value, and determine what distance is the standard to divide, is the K nearest neighbor algorithm is the most important part of the core.

2) Simple implementation

Look at the following two functions:

Def createtmpdata ():
    group = Array ([[1.0,1.1],[1.0,1.0],[0,0],[0,0.1]])
    labels = [' A ', ' a ', ' B ', ' B ']
    Return group, Labels

def classify0 (in_data,datas,labels,k):
    datas_height = datas.shape[0]
    Diff_array = Tile (in_data,[datas_height,1])-datas
    diff_array_power2 = diff_array**2 Distance_array
    = Diff_array_ Power2.sum (axis=1)
    Distance_array = distance_array**0.5
    Sorted_index = Distance_array.argsort ()
    Labels_dict = {} for
    I in range (k):
        label = labels[sorted_index[i]]
        Labels_dict[label] = Labels_dict.get (label,0) + 1
    sorted_datas = sorted (Labels_dict.iteritems (), Key=operator.itemgetter (1), reverse=true)
    return sorted_datas[0][0]

The first function simply generates a temporary test sample, the second function is the core of a streamlined KNN algorithm, input to be tested data and sample data sets, and the most important k value, where the distance is calculated as a Euclidean distance, in simple terms is the former square root (x squared plus squared), The main purpose here is to simplify the understanding and implementation of the algorithm, only to select the two-dimensional characteristics of the sample, but in the actual application of the object may be far more than two-dimensional characteristics. 3) Experimental Application

After the simple implementation of the algorithm, we get an appointment information dataset, the dataset consists mainly of three characters (the annual flight distance, the percentage of time spent on entertainment per day, the amount of ice cream consumed each year), and the person's label (which is worth dating, general, not worth dating):

For data processing, you can read the memory before preprocessing, for example, the above data, you can see that in the annual flight distance this feature, its order of magnitude is the other two characteristics of the explosion, that is, the effect of the feature on the results of the other two characteristics, which does not conform to the concept of KNN algorithm, So after reading the data, you need to do a normalized treatment:

def autonormal (datas):
    height = datas.shape[0]
    max_array = Datas.max (0)
    Min_array = datas.min (0)
    Range_array = Max_array-min_array
    normaled_data = (Datas-tile (Min_array, (height,1)))/tile (Range_array, (height, 1) return
    normaled_data, Range_array, Min_array

After processing, thanks to the powerful library of Python language, we can visualize the data using the Matplotlib library (two of them selected):

def drawImage (datas,labels):
    fig = plt.figure ()
    ax = Fig.add_subplot ($)
    Ax.scatter (datas[:,1],datas[: , 2],5.0*array (labels), 15.0*array (labels))
    plt.show ()

Three kinds of colors are representative of dislike, general, like, horizontal ordinate is one of the two characteristics respectively.

Combined with the implementation of the core algorithm just now, we choose 90% data to generate the data model, and then use the remaining 10% to test, the final statistical model of the correct classification ability, that is, the error rate for how much, the evaluation method chosen here is 0-1 loss, the test error is absolute error rate.

Def datingclasstest ():
    datas, labels = loaddatafromfile (' datingTestSet2.txt ')
    height = datas.shape[0]
    Ratio = 0.1
    test_nums = Int (0.1*height)
    Normaled_datas, range_array, Min_array = Autonormal (datas)
    Test_ Datas = Normaled_datas[0:test_nums,:]
    error_count = 0.0 for
    i in range (test_nums):
        Result_label = classify0 (test_datas[i],normaled_datas[test_nums:,:],labels[test_nums:],3)
        Print "The class result are:%d, the real answer is%d"% (Result_label,labels[i])
        if (Result_label!= labels[i)): 
  
   error_count + + 1.0
    print "The total error rate is:%f" (Error_count/int (0.1*height))
  

After the execution, you can look at the output:

5 of the error rate, the ability to correctly classify has exceeded 90%, but also acceptable, constantly adjust the parameter K can observe the difference of error rate, finally choose the most appropriate parameters.

After you manually enter the parameters to evaluate:

Def datingclassperson ():
    result_labels = [' Not as ', ' a little like ', ' like ']
    Percents_of_play = float (raw_ Input (' Enter your Percents_of_play: ')
    fly_distance = float (raw_input (' Enter your: '))
    ice = float (Raw_input (' Enter your ice: ')
    datas, labels = loaddatafromfile (' datingTestSet2.txt ')
    Normaled_datas, range_array, Min_array = Autonormal (datas)
    test_data = Array ([Fly_distance,percents_of_play,ice])
    Classed_label = Classify0 ((test_data-min_array)/ range_array,normaled_datas,labels,3)
    print "The result is:%s"% (Result_labels[classed_label-1])

The results are:

4) Matters needing attention

How should we choose the K value of K nearest neighbor?

Take a look at this picture below:

There are two classes in the picture above, one is a black dot, the other is a blue rectangle, and now our point of classification is red Pentagon.

If you choose K=1, we can see that the most recent black sample from the red sample, according to the method of KNN calculation, the samples are classified directly into the black dataset; therefore, when the K value is very small, we can easily learn the noise, it is very easy to determine the noise category; When the K value is the middle value, At the moment most of the samples in the range are blue, and the samples are classified as blue, but the increase in K does not mean that the model becomes reliable when k is N, which is k=n:

Can feel this time is the end result of the classification point is a very simple sample set of the maximum number of samples, such classification results are not reliable.

Therefore, in the KNN algorithm, for K is worth choosing, usually take the cross validation method to select the best K value, repeated use of the sample data, the sample data after slicing the training verification.


Reference:

Understand K nearest neighbor (K-NN) algorithm (i)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.