Machine learning Algorithm one: K-Nearest neighbor algorithm

Source: Internet
Author: User

recently in the "machine learning Combat" in the study of some basic algorithms, for a pure novice I also found on the Internet to write information, the following on the book I see Plus on other blog content to do a summary, blog please refer to http://www.cnblogs.com/ Baiyishaonian/p/4567446.htmlK-Nearest Neighbor algorithm

The K-Nearest neighbor algorithm is used to measure the distance between different eigenvalues to classify.

Advantages: high precision, insensitive to outliers, no data input assumptions.

Disadvantages: High computational complexity and high spatial complexity.

applicable range: numerical and nominal type.

Working principle:

There is a collection of sample data, also known as a training sample set, and each data in the sample set has a label, that is, we know the corresponding relationship between each data in the sample set and the owning classification, and after entering new data with no tags, each feature of the new data is compared with the characteristics of the sample set data. Then the algorithm extracts the classification label of the most similar data in the sample set. In general, we only select the first k most similar data in the sample data set, which is the origin of the K-nearest neighbor algorithm name.

Importing data using Python

From the working principle of K-Nearest neighbor algorithm, we can see that in order to implement this algorithm to classify data, we need sample data on hand, no sample data how to set up the classification function. So, our first step is to import the sample data collection.

Create a module named knn.py and write the code:

1 from numpy Import * Import Operator3 4 def createdataset (): 5     group = Array ([[1.0,1.1],[1.0,1.0],[0,0],[0,0.1]]) 6
   labels = [' A ', ' a ', ' B ', ' B ']7     return group, labels

Code, we need to import two modules of Python: Scientific Computing package NumPy and operator modules. The NumPy function library is a standalone module in the Python development environment, and most Python versions do not have a default installation of the NumPy library, so here we need to install the module separately.

We create datasets in the CreateDataSet () function, Group and label lable as training samples, according to the working principle, each data in the dataset has a label, labels contains the number of elements equal to the number of rows of the group matrix. Here we define the data point (1,1.1) as Class A, and the data point (0,0.1) is defined as Class B. The data in the example is arbitrarily selected and does not give the axis coordinates.

Implementation of K-nearest neighbor algorithm

The specific idea of the K-nearest neighbor algorithm is as follows:

(1) Calculate the distance between the points in a well-known category dataset and the current point

(2) Sorting in ascending order of distance

(3) Select K points with a minimum distance from the current point

(4) Determine the frequency of the category in which the first K points are present

(5) Returns the category with the highest frequency in the first K points as the current point of the forecast classification

The code for the Python language implementation of the K-nearest neighbor algorithm is as follows:

1 # coding:utf-8 2  3 from numpy import * 4 import operator  5 import KNN 6  7 Group, labels = knn.createdatase T () 8  9 def classify (InX, dataSet, labels, k): ten     datasetsize = dataset.shape[0]  one     diffmat = Tile (InX, (d atasetsize,1))-dataSet12     Sqdiffmat = diffmat**213     sqdistances = sqdiffmat.sum (Axis=1)     distances = sqdistances**0.515     sorteddistances = Distances.argsort ()     ClassCount = {}17 for     I in range (k): 18         Numoflabel = labels[sorteddistances[i]]19         Classcount[numoflabel] = Classcount.get (numoflabel,0) + 120     Sortedclasscount = sorted (Classcount.iteritems (), Key=operator.itemgetter (1), reverse=true)     return sortedclasscount[0][0]22 my = classify ([0,0], group, labels, 3) print my

The results of the operation are as follows:

The output is B: It shows that our new data ([0,0]) belongs to Class B.

Code explanation

I believe a lot of friends have a lot to do with this code, and then I'll focus on some of the key points of this function to make it easier for readers and myself to review the algorithm code.

Parameters of the classify function:

    • InX: Input vectors for classification
    • DataSet: Training Sample Collection
    • Labels: tag vector
    • K in the k:k-nearest neighbor algorithm

Shape: Is the property of the array that describes the dimension of a multidimensional array

Tile (InX, (datasetsize,1)): InX Two-dimensional array, datasetsize represents the number of rows after the array is generated, and 1 represents a multiple of the column. The entire line of code indicates that each element of the previous two-dimensional array matrix is subtracted from the corresponding element value of the latter array, thus eliminating the subtraction between the matrices.

Axis=1: When the parameter equals 1, the sum of the numbers between rows in the matrix is equal to 0, which represents the sum of the numbers between columns.

Argsort (): Sort an array in a non-descending order

Classcount.get (numoflabel,0) + 1:get (): This method is a method of accessing a dictionary item, that is, accessing an item with a subscript key of Numoflabel, and if not, the initial value is 0. Then add 1 to the value of this item. So it's simple and efficient to do this in Python with just one line of code.

Machine learning Algorithm one: K-Nearest neighbor algorithm

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.