Machine learning actual Combat learning Notes 1--KNN algorithm __ algorithm

Source: Internet
Author: User
First, KNN algorithm overview:

the working principle of 1.KNN algorithm is:

(1) There is a training sample set, and know the corresponding relationship between each data and the classification of the sample set, that is, there is a category label for each data.
(2) If the new data with no label is entered, the characteristics of the new data are compared with those of the dataset, then the classification label of the most similar data in the sample set is extracted by the algorithm.
(3) Finally, select K (according to the actual situation of free choice of not more than 20 of integers) the most similar data in the most frequent categories, as the classification of new data.

2.KNN algorithm Advantages and disadvantages:

(1) Advantages: High precision, insensitive to abnormal values, no data input assumptions.
(2) Disadvantages: High computational complexity and high space complexity.
Scope of application: numerical and nominal data. two, the KNN algorithm actual combat 1: Take the film classification as an example

1. Data preprocessing

For ease of testing, there are only two categories when labeling raw data. The specific Python code looks like this:

From numpy Import *
import operator
def createdataset ():
    group = Array ([[1.0,1.1],[1.0,1.0],[0,0],[0,0.1] ]
    labels = [' A ', ' a ', ' B ', ' B '] return
    group,labels

2. Perform the following actions for each point in the dataset of the Unknown category attribute:

(1) Compute the distance between the point in the known class dataset and the current point;
(2) sorted by distance increment order;
(3) Selecting the K point with the minimum distance from the current point;
(4) Determine the frequency of the class where the first k points occur;
(5) Returns the category with the highest frequency of the first K points as the forecast classification of the current point.

3.KNN Algorithm implementation:

def classify0 (inx,dataset,labels,k):
    datasetsize = dataset.shape[0]
    #第一步, calculates the European distance
    Diffmat = Tile (InX, datasetsize,1)-DataSet
    Sqdiffmat = Diffmat * * 2
    sqdistances = Sqdiffmat.sum (Axis=1)
    distances = Sqdistances * * 0.5
    sorteddistindicies = Distances.argsort ()
    ClassCount = {} for
    I in range (k):
        Voteilabel = Labels[sorteddistindicies[i]]
        Classcount[voteilabel] = Classcount.get (voteilabel,0) + 1
    Sortedclasscount = sorted (Classcount.iteritems (),
              key = Operator.itemgetter (1), reverse = True)
return SORTEDCLASSCOUNT[0][0]

Test:
three, KNN algorithm combat 2: Handwriting recognition number

1. Data preprocessing

The construction of a handwritten recognition system based on KNN algorithm requires two datasets Trainingdigits and Testdigits, of which the Trainingdigits dataset contains approximately 2000 samples for training classifiers The Testdigits dataset contains approximately 900 samples to test the effect of the classifier.

The implementation code looks like this:

def img2vector (filename):
    returnvect  = Zeros ((1,1024))
    fr = open (filename) for
    i in range:
        Linestr = Fr.readline () for
        J in range (k):
            returnvect[0,32*i+j] = Int (linestr[j)) return
    Returnvect

2. Using KNN algorithm to recognize handwritten numerals

Def handwritingclasstest (): Lllabels = [] trainingfilelist = Listdir (' trainingdigits ') m = Len (trainingfilelis T) Trainingmat = zeros ((m,1024)) for I in Range (m): Filenamestr = trainingfilelist[i] Filestr = fi Lenamestr.split ('. ') [0] classnumstr = Int (Filestr.split ('_') [0]) lllabels.append (CLASSNUMSTR) trainingmat[i,:] = Img2 Vector (' trainingdigits/%s '% filenamestr) testfilelist = Listdir (' testdigits ') errorcount = 0.0 mtest = Len (te Stfilelist) for I in Range (mtest): Filenamestr = testfilelist[i] Filestr = Filenamestr.split ('. ')
        [0] classnumstr = Int (Filestr.split ('_') [0]) Vectorundertest = Img2vector (' testdigits/%s '% filenamestr)
        Classifierresult = classify0 (vectorundertest,\ trainingmat,lllabels,3) Print "The classifier came back with:%d, the real answer is:%d" \% (classifierresult,classn
  UMSTR)      if (Classifierresult!= classnumstr): Errorcount + + 1.0 print "\nthe total number of errors is:%d "% errorcount print" \nthe total error rate is:%f% (Errorcount/float (mtest))

Test results:
Iv. Summary

KNN algorithm is the most simple and effective algorithm for classifying data, which is based on the learning of examples, but it must have the training sample data close to the actual data when using the algorithm.
If you have a large training dataset, you must use a large amount of storage space and calculation time.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.