Machine Learning Combat--KNN classifier

Last Update:2014-12-08 Source: Internet

Author: User

Tags ranges

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Lazy Learning: Simply store the data, wait until a test tuple is given, and classify it according to the similarity of the stored tuples. The KNN (k nearest Neighbor) Classification method was proposed in the 1950s, because of the computationally intensive algorithm, it was gradually applied after the 60 's with the increase of computing power.

KNN is based on analogy learning, which represents a given test tuple as a point in an n-dimensional space, and n represents the number of attributes. Then use some distance metric to find the K training tuples closest to a given test tuple, to count the categories of this k training tuple, and to return categories with a number of categories as unknown test tuples.

The commonly used distance measurement is Euclidean distance, also known as the second norm. At the same time, in order to reduce the influence of the range of different attribute values on the distance calculation, the maximum-minimum normalization is used to transform the attribute values to the [0,1] interval.

According to the above characteristics, the KNN algorithm is best applied to the numerical attribute, for the ordinal attribute can be transformed to a numerical type, the nominal attribute normalization is also better, but the two-dollar attribute may not be very good. Main advantages and Disadvantages:

Advantages: High accuracy, insensitive to noise, no data input assumptions required

Cons: High complexity of time and space, need to determine K value (k value determination may require a lot of experience)

Here is the implementation of the KNN algorithm in the book "Machine Learning Combat" to actually classify the data of a spam message. This data contains 3,065 training samples and 1536 test samples. Each sample uses 57 features, with numeric data and a two-tuple attribute, and the class label {0,1},0 means not spam, and 1 means spam.

The first is to read the data from the file:

def loaddataset (FP): if Os.path.exists (fp): TRY:FH = open (FP, ' r ') Rtntrainset = Zeros ((T            Rainset_num, features_num)) Trainlabel = [] Rtntestset = Zeros ((testset_num, Features_num)) TestLabel = [] i = 0 for line in fh:line = Line.strip () terms = Li                    Ne.split (', ') if I < trainset_num:rtntrainset[i,:] = Terms[0:features_num] Trainlabel.append (int (terms[features_num])) i + = 1 else:rt                    Ntestset[i-trainset_num,:] = Terms[0:features_num] testlabel.append (int (terms[features_num]))            i + = 1 except Exception, Msg:print ' An unexcepted error occur: ', MSG finally: Fh.close () return Rtntrainset,trainlabel,rtntestset,testlabel else:print "The data file does not        Exists! " Return NOne

Each row of the file represents a sample, each feature is separated by commas, and the last field is the category label. This is used by NumPy, a third-party library that stores data in a matrix.

Then the normalized processing of the data is normalized to the [0,1] interval.

Def normalize (DS):    minvals = ds.min (0)    maxvals = Ds.max (0)    ranges = maxvals-minvals    Normds = zeros (shap E (DS))    n = ds.shape[0]    Normds = Ds-tile (Minvals, (n, 1))    Normds = Normds/tile (ranges, (n, 1))    return n Ormds

Finally, the implementation of the classifier is to input each sample to be tested, using the training sample to calculate the nearest K neighbors and then derive the most categories.

Def knnclassify (ds, labels, K, inputx):    dssize = ds.shape[0]    diff = Tile (inputx, (dssize, 1))-ds    Sqdiff = di FF * * 2    sqdist = sqdiff.sum (axis = 1)    dist = sqdist * * 0.5    sorteddist = Dist.argsort ()    ClassCount = {}
   
    for I in range (k):        Votedlabel = labels[sorteddist[i]]        Classcount[votedlabel] = classcount.get (Votedlabel, 0) + 1    sortedclasscount = sorted (Classcount.iteritems (),                              key = Operator.itemgetter (1), reverse = True)    return sortedclasscount[0][0]

1536 test data were classified, and different k values were used to obtain the accuracy of different k values, and the following graphs were plotted:

It can be seen that when K is 10, the accuracy of 75% is the highest, the result is not too good, the main reason is that there are some binary attributes of the 57 characteristics of spam, when the use of KNN to classify the calculation distance will have a greater impact.

Finally, using k=10 to test the training set, the following results are obtained.

Train accuracy:0.9282

Machine Learning Combat--KNN classifier

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More