Machine Learning Combat--KNN classifier

Source: Internet
Author: User
Tags ranges

Lazy Learning: Simply store the data, wait until a test tuple is given, and classify it according to the similarity of the stored tuples. The KNN (k nearest Neighbor) Classification method was proposed in the 1950s, because of the computationally intensive algorithm, it was gradually applied after the 60 's with the increase of computing power.

KNN is based on analogy learning, which represents a given test tuple as a point in an n-dimensional space, and n represents the number of attributes. Then use some distance metric to find the K training tuples closest to a given test tuple, to count the categories of this k training tuple, and to return categories with a number of categories as unknown test tuples.

The commonly used distance measurement is Euclidean distance, also known as the second norm. At the same time, in order to reduce the influence of the range of different attribute values on the distance calculation, the maximum-minimum normalization is used to transform the attribute values to the [0,1] interval.

According to the above characteristics, the KNN algorithm is best applied to the numerical attribute, for the ordinal attribute can be transformed to a numerical type, the nominal attribute normalization is also better, but the two-dollar attribute may not be very good. Main advantages and Disadvantages:

Advantages: High accuracy, insensitive to noise, no data input assumptions required

Cons: High complexity of time and space, need to determine K value (k value determination may require a lot of experience)

Here is the implementation of the KNN algorithm in the book "Machine Learning Combat" to actually classify the data of a spam message. This data contains 3,065 training samples and 1536 test samples. Each sample uses 57 features, with numeric data and a two-tuple attribute, and the class label {0,1},0 means not spam, and 1 means spam.

The first is to read the data from the file:

def loaddataset (FP): if Os.path.exists (fp): TRY:FH = open (FP, ' r ') Rtntrainset = Zeros ((T            Rainset_num, features_num)) Trainlabel = [] Rtntestset = Zeros ((testset_num, Features_num)) TestLabel = [] i = 0 for line in fh:line = Line.strip () terms = Li                    Ne.split (', ') if I < trainset_num:rtntrainset[i,:] = Terms[0:features_num] Trainlabel.append (int (terms[features_num])) i + = 1 else:rt                    Ntestset[i-trainset_num,:] = Terms[0:features_num] testlabel.append (int (terms[features_num]))            i + = 1 except Exception, Msg:print ' An unexcepted error occur: ', MSG finally: Fh.close () return Rtntrainset,trainlabel,rtntestset,testlabel else:print "The data file does not        Exists! " Return NOne 
Each row of the file represents a sample, each feature is separated by commas, and the last field is the category label. This is used by NumPy, a third-party library that stores data in a matrix.

Then the normalized processing of the data is normalized to the [0,1] interval.

Def normalize (DS):    minvals = ds.min (0)    maxvals = Ds.max (0)    ranges = maxvals-minvals    Normds = zeros (shap E (DS))    n = ds.shape[0]    Normds = Ds-tile (Minvals, (n, 1))    Normds = Normds/tile (ranges, (n, 1))    return n Ormds

Finally, the implementation of the classifier is to input each sample to be tested, using the training sample to calculate the nearest K neighbors and then derive the most categories.

Def knnclassify (ds, labels, K, inputx):    dssize = ds.shape[0]    diff = Tile (inputx, (dssize, 1))-ds    Sqdiff = di FF * * 2    sqdist = sqdiff.sum (axis = 1)    dist = sqdist * * 0.5    sorteddist = Dist.argsort ()    ClassCount = {}
   
    for I in range (k):        Votedlabel = labels[sorteddist[i]]        Classcount[votedlabel] = classcount.get (Votedlabel, 0) + 1    sortedclasscount = sorted (Classcount.iteritems (),                              key = Operator.itemgetter (1), reverse = True)    return sortedclasscount[0][0]
   
1536 test data were classified, and different k values were used to obtain the accuracy of different k values, and the following graphs were plotted:



It can be seen that when K is 10, the accuracy of 75% is the highest, the result is not too good, the main reason is that there are some binary attributes of the 57 characteristics of spam, when the use of KNN to classify the calculation distance will have a greater impact.

Finally, using k=10 to test the training set, the following results are obtained.

Train accuracy:0.9282


Machine Learning Combat--KNN classifier

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.