Small white Learning machine learning---KNN

Source: Internet
Author: User
Tags sort sorts
a simple k-nearest neighbor algorithm

This article will start with the idea of K-neighbor algorithm, use Python3 step by step to write code for combat training. And, I also provided the corresponding data set, the code is detailed comments. In addition, this paper also explains the method of Sklearn implementation of K-neighbor algorithm. Practical Examples: Film category classification, dating site matching effect determination, handwritten digit recognition.

If the code is not understood enough, can be combined with this article, watch from the Master of Aeronautics and Astronautics: Deep MoU, for everyone to record free video: Video: http://pan.baidu.com/s/1i55VVM1 Password: QLLK 1.1 Introduction to K-Nearest neighbor Method

K-Nearest Neighbor method (K-nearest neighbor, K-nn) is a basic classification and regression method proposed by cover T and Hart P in 1967. It works by having a collection of sample data, also known as a set of training samples, and a label for each data in the sample set, that is, we know the correspondence of each data in the sample set to the owning category. After entering new data without a label, each feature of the new data is compared with the characteristics of the data in the sample set, and the algorithm extracts the category label of the most similar data (nearest neighbor) of the sample. In general, we only select the first k most similar data in the sample data set, which is the source of K in the K-nearest neighbor algorithm, usually K is an integer not greater than 20. Finally, select the most frequently occurring classification of the K most similar data as the classification of the new data.

For a simple example, we can use the K-nearest neighbor algorithm to classify a movie as a love or action film.

Movie Name Fight Lens Kissing Lens Movie Type
Movie 1 1 101 Love Movies
Movie 2 5 89 Love Movies
Movie 3 108 5 Action film
Movie 4 115 8 Action film

Table 1.1 Number of fighting shots, number of kissing shots, and type of movie in each movie

Table 1.1 is a collection of data that we already have, which is the training sample set. The data set has two features, namely the number of fighting shots and the number of kissing shots. In addition, we also know the type of each film, namely the category label. With the naked eye, a lot of kissing shots are love movies. A lot of fights are action movies. With our years of experience, this classification is reasonable. If you give me a movie now, you tell me the number of fighting shots and the number of kissing shots in this movie. Don't tell me this movie type, I can judge according to the information that you give me, this movie is belong to love piece or action movie. and the K-nearest neighbor algorithm can do this like we do, the difference is that our experience is more "awesome", and the K-neighbor algorithm depends on the existing data. For example, you told me that this movie has a 2 fight shot and 102 kissing shots, and my experience will tell you that this is a love movie, and the K-nearest neighbor algorithm will tell you that this is a love movie. You told me another movie. The number of fighting shots is 49, the number of kissing shots is 51, my "evil" experience may tell you, this may be a "love action movie", the picture is too beautiful, I dare not imagine. (If you don't know what "love action movies" are.) Please comment message to contact me, I need you like me as pure friend. But K-nearest neighbor algorithm will not tell you this, because in its eyes, the movie type Only love and action movies, it will extract the most similar characteristics of the sample dataset (nearest neighbor) classification label, the results may be love or action, but will not be "love action film." Of course, these depend on factors such as the size of the dataset and the criteria for judging the nearest neighbor. 1.2 Distance measurement

We already know that the K-nearest neighbor algorithm is based on feature comparisons and then extracts the most similar data from the sample set (nearest neighbor) to the category label. So, how to compare it. For example, we still take the example of Table 1.1, how to judge the category of the movie with the red dot mark. As shown in Figure 1.1.

Figure 1.1 Movie classification

We can roughly infer from the scatter plot that the red dot-marked movie may be part of the action because it is closer to the dots of the known two action slices. What method does the K-nearest neighbor algorithm use to judge? That's right, distance measurement. This movie classification example has 2 characteristics, that is, in the 2-dimensional real vector space, you can use the two-point distance formula we learned in high school to calculate the distance, as shown in Figure 1.2.

Figure 1.2 Two-point distance formula

By calculation, we can get the following result: (101,20), action (108,5) distance is about 16.55 (101,20), action (115,8) distance is about 18.44 (101,20) and love Film (5,89) The distance of approximately 118.22 (101,20), Love (1,101) is approximately 128.69

The calculation shows that the distance of the red dots marked by the movie to the action movie (108,5) is nearest to 16.55. If the algorithm directly according to this result, judge the red dot Mark movie as action movie, this algorithm is the nearest neighbor algorithm, but not the K-nearest neighbor algorithm. So what is the K-neighbor algorithm? The K-Nearest neighbor algorithm steps are as follows: calculates the distance between the points in a well-known category dataset and the current point, sorts by the ascending order of distances, selects the K-points with the smallest distance from the current point, determines the frequency of the categories in which the first K-points are present, and returns the category with the highest frequency at the top of the

For example, now that I have a K value of 3, then in the film example, the sequence of three points by distance is action (108,5), Action (115,8), Love (5,89). In these three points, the action movie appears frequency is Two-thirds, the love movie appears the frequency is one-third, therefore this red dot marks the movie as the action movie. This discriminant process is the K-nearest neighbor algorithm. 1.3 Python3 Code Implementation

We already know the principle of K-Nearest neighbor algorithm, then the next is to use Python3 to implement the algorithm, still take the film classification as an example. 1.3.1 Preparing data sets

For the data in table 1.1, we can create it directly using NumPy, with the following code:

#-*-Coding:utf-8-*-
import numpy as NP

"" "
Function Description: Create DataSet

Parameters:
    no
Returns:
    Group- DataSet
    Labels-category label
Modify: 2017-07-13 "" "
def createdataset ():
    #四组二维特征
    group = Np.array ([[[1,101],[5,89],[108,5],[115,8]])
    #四组特征的标签
    labels = [' Love movie ', ' Love movie ', ' action movie ', ' action movie ']
    return Group, labels
if __name__ = = ' __main__ ':
    #创建数据集
    Group, labels = CreateDataSet ()
    #打印数据集
    Print (group)
    Print (labels)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26

Run the results as shown in Figure 1.3:

Figure 1.3 Running results

1.3.2 K-Nearest neighbor algorithm

Based on the two-point distance formula, the distance is calculated, the first k points with the smallest distance are selected, and the classification results are returned.

#-*-Coding:utf-8-*-import numpy as NP import operator "" "Function Description: KNN algorithm, classifier Parameters:inx-data for classification (test set) dat ASet-Data for training (training set) Labes-classification label K-KNN algorithm parameters, select K points with minimum distance returns:sortedclasscount[0][0]-classification results MODIFY:20
    17-07-13 "" "Def classify0 (InX, DataSet, labels, k): #numpy函数shape [0] Returns the number of rows in a dataSet datasetsize = dataset.shape[0] 
    #在列向量方向上重复inX共1次 (Landscape), repeat InX datasetsize (portrait) Diffmat = Np.tile (InX, (datasetsize, 1))-DataSet #二维特征相减后平方 in line vector direction
    Sqdiffmat = Diffmat**2 #sum () All elements are added, sum (0) is added, sum (1) line is added sqdistances = Sqdiffmat.sum (Axis=1) #开方, the distance is calculated 
    distances = sqdistances**0.5 #返回distances中元素从小到大排序后的索引值 sorteddistindices = Distances.argsort () #定一个记录类别次数的字典
        ClassCount = {} for I in range (k): #取出前k个元素的类别 Voteilabel = labels[sorteddistindices[i]]
        #dict. Get (Key,default=none), the dictionary's Get () method, which returns the value of the specified key if the value is not returned by default in the dictionary.
  #计算类别次数 Classcount[voteilabel] = Classcount.get (voteilabel,0) + 1  #python3中用items () Replace Iteritems () in Python2 () #key =operator.itemgetter (1) to sort the values of the dictionary #key =operator.itemgetter (0) According to the dictionary key Row sort #reverse降序排序字典 Sortedclasscount = sorted (Classcount.items (), Key=operator.itemgetter (1), reverse=true) #返回次数 Most categories, that is, the category you want to classify return sortedclasscount[0][0]
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 1.3.3 Overall Code

This is the category of the movie (101,20) that predicts the red dot marker, and the K-nn K value is 3. Create the knn_test01.py file and write the following code:

#-*-Coding:utf-8-*-import numpy as NP import operator "" "Function Description: Create DataSet Parameters: no Returns:group-datasets Labels-category label modify:2017-07-13 "" "" Def CreateDataSet (): #四组二维特征 group = Np.array ([[1,101],[5,89],[108,5],[1
    15,8]]) #四组特征的标签 labels = [' Love movie ', ' Love movie ', ' action movie ', ' action movie '] return group, labels "" "Function Description: KNN algorithm, classifier Parameters: InX-Data for classification (test set) dataset-Data for training (training set) Labes-classification label K-KNN algorithm parameters, select K points with a minimum distance returns:sortedclasscount[0] [0]-category results Modify:2017-07-13 "" "Def classify0 (InX, DataSet, labels, k): #numpy函数shape [0] Returns the number of rows in a DataSet datase Tsize = dataset.shape[0] #在列向量方向上重复inX共1次 (landscape), row vector direction repeats InX datasetsize times (portrait) Diffmat = Np.tile (InX, (datasetsize, 1) )-DataSet #二维特征相减后平方 Sqdiffmat = diffmat**2 #sum () All elements are added, sum (0) is added, sum (1) line is added sqdistances = sqdiffmat.sum (Axis=1) #开方, calculates the distance distances = sqdistances**0.5 #返回distances中元素从小到大排序后的索引值 sorteddistindices = DISTANCES.A Rgsort () #定一个记录类别次数的Dictionary ClassCount = {} for I in range (k): #取出前k个元素的类别 Voteilabel = labels[sorteddistindices[i]]
        #dict. Get (Key,default=none), the dictionary's Get () method, which returns the value of the specified key if the value is not returned by default in the dictionary.
    #计算类别次数 Classcount[voteilabel] = Classcount.get (voteilabel,0) + 1 #python3中用items () Replace Iteritems in Python2 () #key =operator.itemgetter (1) Sorts according to the values of the dictionary #key =operator.itemgetter (0) is sorted according to the key of the dictionary #reverse降序排序字典 sortedclasscount = Sorted (Classcount.items (), Key=operator.itemgetter (1), reverse=true) #返回次数最多的类别, i.e. the category to be sorted return sortedclasscount[0]
    [0] if __name__ = = ' __main__ ': #创建数据集 Group, labels = createdataset () #测试集 test = [101,20] #kNN分类 Test_class = classify0 (test, group, labels, 3) #打印分类结果 print (Test_class)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.