KNN (K-nearestneighbor) identifies minist datasets

Source: Internet
Author: User
Tags diff
KNN Algorithm Introduction

The KNN (K Nearest neighbors,k nearest neighbor) algorithm is the simplest and best understood theory in all machine learning algorithms. KNN is an instance-based learning that calculates the distance between new data and the characteristic values of the training data, and then chooses K (k>=1) nearest neighbor to classify (vote) or return. If k=1, then the new data is simply assigned to its nearest neighbor class. KNN algorithm is supervised learning or unsupervised learning. First, consider the definition of supervised learning and unsupervised learning. For supervised learning, the data has a clear label (classification for discrete distributions, regression for continuous distribution), and a model based on machine learning can divide new data into a definite class or get a predictive value. For unsupervised learning, the data does not have a label, and the model that the machine learns is the pattern extracted from the data (extracting deterministic features or clustering, etc.). Clustering, for example, is a model that the machine learns from the learning to determine which original data sets are "more like" the new data. KNN algorithm used for classification, each training data has a clear label, can also be clearly determined that the new data label,knn for the regression will also be based on the value of the neighbors to predict a definite value, so KNN belongs to supervised learning.KNN Algorithm FlowSelect a distance calculation method that calculates the distance of the new data from the data points in the well-known category data by all the characteristics of the dataset in ascending order of distance, and selects the K points with the smallest distance from the current range for discrete classification, and returns the category with the most frequency of k points as a predictive classification ; for regression returns the weighted value of K points as the predicted value by voting in the returned categories of K to select the most frequently occurring category as the final forecast categoryKNN algorithm KeyAll the characteristics of the data have to be compared and quantified.
If there is a non-numeric type in the Data feature, it must be quantified by means of a numerical value. For example, if a sample feature contains a color (red-black-blue), there is no distance between the colors, and the distance can be calculated by converting the color to a grayscale value. In addition, the sample has multiple parameters, each of which has its own domain and range of values, and they have a different effect on the distance calculation, such as a larger influence will be over the value of the smaller parameters. In order to be fair, sample parameters have to do some scale processing, and the simplest way is to take the normalized disposition of all the values of the features. A distance function is required to calculate the distance between two samples.
There are many definitions of distance, such as Euclidean distance, cosine distance, Hamming distance, Manhattan distance, and so on. In general, the Euclidean distance is chosen as the distance metric, but this is only applicable to continuous variables. In general, if some special algorithms are used to calculate the measurement, the accuracy of K nearest neighbor classification can be significantly improved, such as using the large edge nearest neighbor method or the nearest neighbor component analysis method. Determine the value of K
K is a custom constant, and the value of K directly affects the final estimate, and a choice K is worth using the Cross-validate (cross-validation) error statistic selection method. The concept of cross-validation was previously mentioned as part of a sample of data as a training sample, as part of a test sample, such as selecting 95% as a training sample and remaining as a test sample. Train a machine learning model by training data, and then test its error rate with test data. The Cross-validate (cross-validation) Error statistic selection method is to compare the average error rate of cross-validation with different k values, and to select the K value with the lowest error rate. For example, select the k=1,2,3,..., do 100 cross-validation for each k=i, calculate the average error, and then compare and select the smallest one.The advantages and disadvantages of KNN algorithm

One, simple, effective.
Second, the cost of retraining is lower (basic training is not required).
Three, the computation time and the space linear to the training set's scale (in some occasions is not too big), the sample is too big to recognize the time to be very long.
Four, K value comparison is difficult to determine. mnist Handwriting Data recognition

Mnist is a handwritten number library that contains numbers from 0-9, each image size is 32*32, detailed introduction and data download see here

using the Pil,numpy two Python libraries, there is no installation can refer to my other blog to configure the installation, it is not much to say
The code is my modified Daniel's original code generation, see the following reference, I have also uploaded csdn, one is Daniel's original code, a copy of the new

We need to use the KNN algorithm to identify mnist handwritten numbers, the following steps:
first, you need to make handwritten numerals 0 1 strings, the original image of the black pixels into 1, white 0, written txt file;
Python code:

def img2vector (Impath,savepath):
    "Convert the
    image to a numpy array
    Black pixel set to 1,white pixel set to 0
    '
    im = Image.open (impath)
    im = Im.transpose (image.rotate_90)
    im = Im.transpose (image.flip_top_ BOTTOM)

    rows = im.size[0]
    cols = im.size[1]
    imbinary = zeros ((rows,cols)) for
    row in range (0,rows): C11/>for col in Range (0,cols):
            impixel = Im.getpixel ((row,col)) [0:3]
            if impixel = = (0,0,0):
                imbinary[ Row,col] = 0
    #save temp txt like 1_5.txt Whiich represent the class was 1 and the index is 5
    fp = open (Savepath, ' W ') for
    x in range (0,imbinary.shape[0]): For
        y in range (0,imbinary.shape[1]):
            fp.write (str (INT ( Imbinary[x,y]))
        fp.write (' \ n ')
    Fp.close ()

The result might look like this:

convert 0 1 strings in all txt files into line vectors
Python code:

def vectoroneline (filename):
    rows =
    cols =
    imgvector = Zeros ((1, rows * cols)) 
    Filein = open (Filenam e)
    for row in xrange (rows):
        linestr = Filein.readline ()
        for Col in Xrange (cols):
            imgvector[0, row * + C OL] = Int (Linestr[col])
    return imgvector

KNN Recognition
Python code:

def knnclassify (Testimput, Trainingdataset, Traininglabels, k): NumSamples = dataSet.shape[0 ] # Shape[0] stands for the num of row #calculate the Euclidean distance diff = tile (newinput, (NumSamples, 1))- DataSet # Subtract element-wise squareddiff = diff * * 2 # squared for the Subtract squareddist = SUM (Squareddiff, Axis = 1) # sum is performed by row distance = squareddist * * 0.5 #sort the distance vector sorteddistindices = Argsort (distance) #choose k elements ClassCount = {} # define a dictionary (can be-append element) for I in Xrange (k): Votelabel = Labels[sorteddistindices[i]] #initial the dict classcount[votelabel] = Clas Scount.get (Votelabel, 0) + 1 #vote the label as final return maxCount = 0 for key, value in Classcount.items () : if value > maxcount:maxcount = value Maxindex = key return Maxindex 

* Recognition Results *

References:
[1] Daniel's blog: http://blog.csdn.net/zouxy09/article/details/16955347
[2]matlab implements KNN: http://blog.csdn.net/rk2900/article/details/9080821
[3] Classification algorithm advantages and disadvantages: http://bbs.pinggu.org/ Thread-2604496-1-1.html
[4] Code download address: http://download.csdn.net/detail/gavin__zhou/9208821
/HTTP// download.csdn.net/detail/gavin__zhou/9208827

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.