"Play machine learning with Python" KNN * code * One

Source: Internet
Author: User

KNN is the abbreviation of "K Nearest Neighbors", Chinese is "nearest neighbor classifier". The basic idea is that for an unknown sample, the distance between each sample in the sample and the training set is calculated, the nearest K sample is chosen, and the category results corresponding to the K sample are used to vote, and the final category of the majority is the result of the classification of the unknown sample. Choosing what metrics to measure the distance between samples is key.


First, the sample from the text to read the characteristics and classification results.

' Knn:k Nearest neighbors ' import numpy as NP ' Function:load the feature Maxtrix and the target labels from TXT file (datingTestSet.txt) input:the name of file to Readreturn:1. The feature matrix2. The target label "Def loadfeaturematrixandlabels (fileinname):    # Load all the samples into memory    Filein = open (f Ileinname, ' r ')    lines = Filein.readlines ()    # load the feature matrix and label vector    Featurematrix = Np.zeros (Len (lines), 3), Dtype=np.float64)    Labellist = list ()    index = 0 for line in    lines:        items = Line.strip (). Split (' \ t ') # The first        three numbers Was the input features        featurematrix[index,:] = [Float (item) for item in Items[0:3]] # The last column is the        l Abel        Labellist.append (items[-1])        index + = 1    filein.close ()    return Featurematrix, labellist

Each sample is stored in a text file in the format: 3 eigenvalues, plus a classification result, separated by the TAB key. The code first puts all the file load into memory and then creates a floating-point matrix with a "number of samples * Number of features", initialized with 0.0. After that, each row of data (samples) is parsed and the matrix is initialized with the parsed data. This line uses the list derivation in Python:

Featurematrix[index,:] = [Float (item) for item in Items[0:3]]

A For loop, which is finished with a statement, and runs more efficiently than (not less than) the normal notation for the for loop. Now it's time to realize Python's good.


Second, the characteristic value normalization

Eigenvalue normalization is an essential step for most machine learning algorithms. Normalization is usually done by taking the maximum and minimum values corresponding to each feature dimension, and then using the current eigenvalues to compare them to a number that is normalized to [0,1]. If the characteristic value is noisy, the noise should be removed beforehand.

"Function:auto-normalizing the feature matrix the    formula Is:newvalue = (oldvalue-min)/(max-min) input:the FEA ture matrixreturn:the normalized feature matrix "def Autonormalizefeaturematrix (Featurematrix):    # Create the Normalized feature matrix    Normfeaturematrix = Np.zeros (featurematrix.shape)    # normalizing The matrix    LineNum = featurematrix.shape[0]    columnnum = featurematrix.shape[1] for    i in range (0,columnnum):        minValue = Featurematrix[:,i].min ()        maxValue = Featurematrix[:,i].max () for        J in Range (0,linenum):            Normfeaturematrix[j,i] = (Featurematrix[j,i]-minValue)/(Maxvalue-minvalue)    return Normfeaturematrix

The basic data structure of numpy is a multidimensional array, and the matrix is a special case of multidimensional arrays. Each multidimensional array of numpy has a shape attribute. Shape is a tuple (list?). ), characterizing the size of each dimension in a multidimensional array, for example: Shape[0] Indicates how many rows, Shape[1] Indicates how many columns ... in the matrix in NumPy, access to a row is "featurematrix[i,:]" and access to the column is " Featurematrix[:,i] ". This part of the code is a normal double loop, compared to the same as C, but the original code is also used in the matrix to calculate, I write when not familiar with the NumPy, the code in the book is not debugging, the way to write directly in C.


Three, the distance between the sample calculation

Distances can be measured in terms of Euclidean distance, which is the distance between the given sample (eigenvector) and all training samples.

"' Function:calculate the Euclidean distance between the feature vector of input sample andthe feature matrix of the Samp Les in training setinput:1. The input feature Vector2. The feature matrixreturn:the distance Array "def calceucdistance (Featurevectorin, Featurematrix):    # Extend the Input feature vector as a feature matrix    linenum = featurematrix.shape[0]    featurematrixin = Np.tile ( Featurevectorin, (linenum,1))    # Calculate the Euclidean distance between the matrix    Diffmatrix = featurematrixin -Featurematrix    Sqdiffmatrix = Diffmatrix * * 2    Distancevaluearray = Sqdiffmatrix.sum (Axis=1)    Distancevaluearray = Distancevaluearray * * 0.5    return Distancevaluearray

Used in the numpy of the more distinctive things. The practice is to first extend the input eigenvectors into a feature matrix (the tile function is dry, the first parameter is what to extend, and the second parameter is extended on which dimensions: LineNum is expanded vertically, horizontally without scaling). Then, is the extension of the matrix and the training sample of the matrix between the calculation-can be used to solve the problem of the computation between vectors, not to expand into a matrix to do, this efficiency ... It can be seen that Python's inefficiency, on the one hand, actually stems from the Python language itself implementation and execution efficiency, on the other hand, more from the Python writing program thinking-programmers want to lazy, what is the CPU?


Not finished, to be continued.


If reproduced, please specify the source: http://blog.csdn.net/xceman1997/article/details/44994001


"Play machine learning with Python" KNN * code * One

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.