"Machine learning" K-Nearest neighbor algorithm and algorithm example

Source: Internet
Author: User
Tags ranges

In machine learning, the classification algorithm is often used, and in many classification algorithms there is an algorithm named K-nearest neighbor, also known as KNN algorithm.

First, the KNN algorithm working principle

Second, the application of the situation

Third, the algorithm example and explanation

---1. Collect data

---2. Preparing the data

---3. Design algorithm Analysis data

---4. Test algorithm

First, the KNN algorithm working principle

Official explanation: There is a sample dataset, also known as a training sample set, and each data in the sample has a label, that is, we know the corresponding relationship between each data in the sample set and the owning category, and after entering new data with no tags, each feature of the new data is compared with the characteristics of the data in the sample set. The algorithm then extracts the classification labels of the most similar data (nearest neighbor) in the sample set. In general, we only select the first k most similar data in the sample set, this is the K-nearest neighbor algorithm in the source, usually k is not greater than 20 integers, and finally, select K Most similar data in the most occurrences of the classification, as the new data classification.

My understanding: The K-nearest neighbor algorithm is based on "the classification of new data depends on its neighbors", such as most of the neighbors are veterans, then this person is also most likely a veteran. The goal of the algorithm is to find its neighbors, and then analyze the majority of the neighbors of the classification, most likely it is the province's classification.

Second, the application of the situation

Advantages: High precision, insensitive to abnormal data (your category is determined by most of the neighbors, an exception neighbor does not affect too much), no data input assumptions;

Cons: High computational complexity (need to calculate the "distance" between new data points and each data in the sample set to determine if it is the former K-neighbor), with a high degree of space (huge matrix);

Applicable data range: numeric (the target variable can be taken from an infinite set of values) and the nominal type (the target variable is only in the limited target set value).

Third, the algorithm example and explanation

Examples in the case of "machine learning Combat" in the book, code examples are written in Python (need NumPy Library), but the algorithm, as long as the algorithm is clear, in other languages can be written out:

Helen has been using online dating sites to find the right date for her. Although the dating site would recommend a different person, she did not find someone she liked. After a summary, she found that she had dated three types of people: 1. People you don't like (); 2. Attractive people ( Span style= "COLOR: #000000" (hereinafter referred to as 2 ); 3. Very attractive person ()
Span style= "font-family: italics" > Despite the above-mentioned rules, Helen is still unable to classify the matching objects recommended by the dating site into the appropriate category. She thinks she can date some of the most glamorous people from Monday to Friday, while the weekend prefers to be accompanied by those who are very attractive. Helen hoped that our classification software would be better able to help her classify the matching objects into the exact categories. In addition, Helen collects data that has not been recorded on dating sites, and she believes the data is more useful for matching object collations.

Let's take a look at the objective of this case: to classify a designated person (1 or 2 or 3) based on some data information. What information do we need to achieve this goal using the KNN algorithm? As mentioned earlier, we need sample data, and we find that the sample data is " Helen also collects data from some dating sites that have not been recorded." OK, let's get started!

----1. Collect Data

Helen collects data about three characteristics of a person: the number of frequent flyer miles earned each year, the percentage of time spent playing video games, and the number of ice cream litres consumed per week. The data is a txt format file, for example, the first three columns are three features in turn, the fourth column is the category (1: People who dislike, 2: Charismatic people, 3: Charismatic people), each line represents a person.

The download link for the data document is: Http://pan.baidu.com/s/1jG7n4hS


---- 2. Preparing the data

What is the preparation data? Previously collected data, put in the TXT format of the document, it seems more regular, but the computer does not know AH. The computer needs to read the data from the TXT document and format the data, that is, to the matrix and use the matrix to load the data, so that computer processing can be used.

Requires two matrices: one for three feature data, and one for the corresponding classification of the load. So, we define a function, the input of the function data document (TXT format), the output is two matrices.

The code is as follows:

def file2matrix (filename):    fr = open (filename)    numberoflines = Len (Fr.readlines ())    Returnmat = Zeros (( NumberOfLines, 3))    classlabelvector = []    fr = open (filename)    index = 0 for line in    Fr.readlines ():        line = Line.strip ()        listfromline = Line.split (' \ t ')        returnmat[index,:] = Listfromline[0:3]        Classlabelvector.append (int (listfromline[-1]))        index + = 1    return Returnmat, Classlabelvector

Briefly interpret the code: first open the file, read the number of rows of the file, and then initialize the two matrices (Returnmat, classlabelsvector) to be returned, then go into the loop and assign each row's data to the Returnmat and Classlabelsvector.

----3. Design Algorithm Analysis data

The purpose of the K-nearest neighbor algorithm is to find the first k neighbors of the new data, and then determine the classification of the data according to the neighbor's classification.

The first problem to be solved is what is a neighbor? Of course, "distance" is near, how to determine the distance between different people? This is a bit abstract, but we have 3 characteristic data for each person. Everyone can use these three feature data instead of this person-the three-dimensional point. For example, the first person in a sample can be replaced with (40920, 8.326976, 0.953952), and his classification is 3. So the distance at this point is the distance:

Point A (x1, x2, x3), point B (Y1, y2, y3), the distance between the two points is: (x1-y1) ^2+ (x2-y2) ^2+ (X3-y3) The square root of ^2. Find the distance between the new data and each point in the sample, then start from small to large, the K-nearest neighbor, and then see what is the most classified K-nearest neighbor, and get the final answer.

This process is also put into a function, the code is as follows:

def classify0 (InX, DataSet, labels, k):    datasetsize = dataset.shape[0]    Diffmat = Tile (InX, (datasetsize,1))-Dat ASet    Sqdiffmat = diffmat**2    sqdistances = sqdiffmat.sum (Axis=1)    distances = sqdistances**0.5    Sorteddistindicies = Distances.argsort ()    classcount={} for    I in range (k):        Voteilabel = labels[ Sorteddistindicies[i]]        Classcount[voteilabel] = Classcount.get (voteilabel,0) + 1    sortedclasscount = sorted ( Classcount.iteritems (), Key=operator.itemgetter (1), reverse=true)    return sortedclasscount[0][0]

A brief explanation of the code: the function of 4 parameters for the new data of three features INX, sample Data feature set (the return value of the previous function), the sample data classification (the return value of the previous function), K, function return bit new data classification. The second row datasetsize gets the number of rows of the feature set matrix, the third behavior is the difference between the new data and the sample data, the fourth row takes the difference to the square, and then the sum, then the square root. The sort functions used in the code are all Python-brought.

Well, now we can analyze the data, but one thing we don't know is that we go back to that data set, the first column represents the characteristic value far greater than the other two characteristics, so that in the formula for distance will occupy a large proportion, so that the distance of two points depends largely on this feature, which is of course unfair , the three features we need are all flat to determine the distance, so we have to deal with the data, hoping that the processing will not affect the relative size and can be fair :

This method is called, normalized value, by which the range of values for each column can be divided into 0~1 or -1~1: The formula to be processed is:

NewValue = (oldvalue-min)/(Max-min)

The function code for normalized values is:

def autonorm (dataSet):    minvals = dataset.min (0)    maxvals = Dataset.max (0)    ranges = Maxvals-minvals    Normdataset = Zeros (Shape (dataSet))    m = dataset.shape[0]    normdataset = Dataset-tile (Minvals, (M, 1))    Normdataset = Normdataset/tile (ranges, (M, 1))    return normdataset, ranges, minvals

---4. Test algorithm

After the formatted data, normalized values, and we have completed the KNN core algorithm function, can now be tested, the test code is:

Def datingclasstest ():    hoRatio = 0.10    Datingdatamat, datinglabels = File2matrix (' datingTestSet.txt ')    Normmat, ranges, minvals = Autonorm (datingdatamat)    m = normmat.shape[0]    numtestvecs = int (M * hoRatio)    Errorcount = 0.0 for    i in range (numtestvecs):        Classifierresult = Classify0 (Normmat[i,:], normmat[numtestvecs:m ,:], datinglabels[numtestvecs:m], 3)        print "The classifier came back with:%d, the real answer is:%d"% (classifierr Esult, Datinglabels[i])        if (classifierresult! = Datinglabels[i]): Errorcount + = 1.0    Print "The total error rate I S:%f "% (Errorcount/float (numtestvecs))

By testing the code we can recall the whole process of this example:

    • Read TXT file, extract the data inside to Datingdatamat, datinglabels;
    • Normalized data, get normalized data matrix;
    • There is more than one test data, and a loop is needed to classify each test data in turn.

People in the code may not quite understand what Horatio is. Note that the test data here is not another batch of data but part of the previous data set, so we can compare the results of the algorithm with the original classification to see the accuracy of the algorithm. Here, Helen provides a data set of 1000 rows, we put the first 100 rows as test data, the last 900 rows as a sample data set, now you should be able to understand what Horatio is.

The overall code:

From numpy import *import operatordef classify0 (InX, DataSet, labels, k): Datasetsize = dataset.shape[0] Diffmat = t Ile (InX, (datasetsize,1))-DataSet Sqdiffmat = diffmat**2 sqdistances = sqdiffmat.sum (axis=1) distances = Sqdist ances**0.5 sorteddistindicies = Distances.argsort () classcount={} for I in range (k): Voteilabel = Labels[s Orteddistindicies[i]] Classcount[voteilabel] = Classcount.get (voteilabel,0) + 1 Sortedclasscount = sorted (ClassC    Ount.iteritems (), Key=operator.itemgetter (1), reverse=true) return sortedclasscount[0][0]def file2matrix (filename): FR = open (filename) numberoflines = Len (Fr.readlines ()) Returnmat = Zeros ((NumberOfLines, 3)) Classlabelvector = [] fr = open (filename) index = 0 for line in Fr.readlines (): line = Line.strip () Listfromline = Lin        E.split (' \ t ') Returnmat[index,:] = Listfromline[0:3] classlabelvector.append (int (listfromline[-1))) Index + = 1 return retuRnmat, Classlabelvector def autonorm (dataSet): minvals = dataset.min (0) maxvals = Dataset.max (0) ranges = maxval S-minvals normdataset = zeros (Shape (dataSet)) m = dataset.shape[0] Normdataset = Dataset-tile (Minvals, (M, 1) ) Normdataset = Normdataset/tile (ranges, (M, 1)) return normdataset, ranges, Minvalsdef datingclasstest (): HoRa Tio = 0.10 Datingdatamat, datinglabels = File2matrix (' datingTestSet.txt ') Normmat, ranges, minvals = Autonorm (dating        Datamat) m = normmat.shape[0] numtestvecs = Int (M * hoRatio) Errorcount = 0.0 for i in range (numtestvecs): Classifierresult = Classify0 (Normmat[i,:], normmat[numtestvecs:m,:], datinglabels[numtestvecs:m], 3) print "T He classifier came back with:%d, the real answer is:%d "% (Classifierresult, datinglabels[i]) if (Classifierresul T! = Datinglabels[i]): Errorcount + = 1.0 print "The total error rate is:%f"% (Errorcount/float (numtestvecs))

Run the code, here I'm using Ipython:

The last error rate is 0.05.


"Machine learning" K-Nearest neighbor algorithm and algorithm example

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.