【用Python玩Machine Learning】KNN * 代碼 * 一

最後更新：2015-04-11 來源：互聯網

上載者：User

創建阿里雲帳戶，並獲得超過 40 款產品的免費試用版；而企業帳戶則可以享有總值 $1200 的免費試用版。立即註冊！

標籤：

KNN的是“k Nearest Neighbors”的簡稱，中文就是“最近鄰分類器”。基本思路就是，對於未知樣本，計算該樣本和訓練集合中每一個樣本之間的距離，選擇距離最近的k個樣本，用這k個樣本所對應的類別結果進行投票，最終多數票的類別就是該未知樣本的分類結果。選擇什麼樣的度量來衡量樣本之間的距離是關鍵。

一、從文本中讀取樣本的特徵和分類結果。

'''kNN: k Nearest Neighbors'''import numpy as np'''function: load the feature maxtrix and the target labels from txt file (datingTestSet.txt)input: the name of file to readreturn:1. the feature matrix2. the target label'''def LoadFeatureMatrixAndLabels(fileInName):    # load all the samples into memory    fileIn = open(fileInName,'r')    lines = fileIn.readlines()    # load the feature matrix and label vector    featureMatrix = np.zeros((len(lines),3),dtype=np.float64)    labelList = list()    index = 0    for line in lines:        items = line.strip().split('\t')        # the first three numbers are the input features        featureMatrix[index,:] = [float(item) for item in items[0:3]]        # the last column is the label        labelList.append(items[-1])        index += 1    fileIn.close()    return featureMatrix, labelList

每個樣本在文字檔中儲存的格式是：3個特徵值，再加一個分類結果，用tab鍵隔開。代碼中首先把所有檔案load進入記憶體，然後建立了一個“樣本數目 * 特徵數目” 的浮點數矩陣，用0.0初始化。之後，解析每一行資料（樣本），並用解析後的資料初始化矩陣。這一行用了python中的列表推導：

featureMatrix[index,:] = [float(item) for item in items[0:3]]

一個for迴圈，用一個語句就寫完了，而且運行效率高於（不低於）正常寫法的for迴圈。現在開始體會到python的好了。

二、特徵值歸一化

特徵值歸一化，對於絕大多數機器學習演算法都是必不可少的一步。歸一化的方法通常是取每個特徵維度所對應的最大、最小值，然後用當前特徵值與之比較，歸一化到[0,1]之間的一個數字。如果特徵取值有雜訊的話，還要事先去除雜訊。

'''function: auto-normalizing the feature matrix    the formula is: newValue = (oldValue - min)/(max - min)input: the feature matrixreturn: the normalized feature matrix'''def AutoNormalizeFeatureMatrix(featureMatrix):    # create the normalized feature matrix    normFeatureMatrix = np.zeros(featureMatrix.shape)    # normalizing the matrix    lineNum = featureMatrix.shape[0]    columnNum = featureMatrix.shape[1]    for i in range(0,columnNum):        minValue = featureMatrix[:,i].min()        maxValue = featureMatrix[:,i].max()        for j in range(0,lineNum):            normFeatureMatrix[j,i] = (featureMatrix[j,i] - minValue) / (maxValue-minValue)    return normFeatureMatrix

numpy的基本資料結構是多維陣列，矩陣作為多維陣列的一個特例。每個numpy的多維陣列都有shape屬性。shape是一個元組（列表？），表徵多維陣列中每一個維度大小，例如：shape[0]表示有多少行，shape[1]表示有多少列...... numpy中的矩陣，對於一行的訪問就是“featureMatrix[i,:]”，對於列的訪問就是“featureMatrix[:,i]”。這部分代碼就是規規矩矩的雙重迴圈，比較像c；不過原來書中的代碼也用矩陣來計算的，我寫的時候還不熟悉numpy，書中的代碼又調試不通，就直接用c的方式來寫了。

三、樣本之間的距離計算

距離可以有很多種衡量方法，這段代碼寫的是歐氏距離的計算，是計算給定樣本（的特徵向量）和所有訓練樣本之間的距離。

'''function: calculate the euclidean distance between the feature vector of input sample andthe feature matrix of the samples in training setinput:1. the input feature vector2. the feature matrixreturn: the distance array'''def CalcEucDistance(featureVectorIn, featureMatrix):    # extend the input feature vector as a feature matrix    lineNum = featureMatrix.shape[0]    featureMatrixIn = np.tile(featureVectorIn,(lineNum,1))    # calculate the Euclidean distance between two matrix    diffMatrix = featureMatrixIn - featureMatrix    sqDiffMatrix = diffMatrix ** 2    distanceValueArray = sqDiffMatrix.sum(axis=1)    distanceValueArray = distanceValueArray ** 0.5    return distanceValueArray

用到了numpy中的比較有特色的東西。做法是先將輸入的特徵向量擴充成為一個特徵矩陣（tile函數乾的，第一個參數是要擴充的東西，第二個參數是在哪些維度上進行擴充：縱向擴充了lineNum次，橫向不進行擴充）。然後，就是擴充出來的矩陣和訓練樣本的矩陣之間的計算了——本來能用向量之間的計算解決的問題，非要擴充成矩陣來做，這效率......可見，python的效率低，一方面的確源於python語言本身的實現和執行效率，另一方面，更源於python寫程式的思維——程式員想偷懶，cpu有啥招兒呢？

未完，待續。

如有轉載，請註明出處：http://blog.csdn.net/xceman1997/article/details/44994001

【用Python玩Machine Learning】KNN * 代碼 * 一

本文章原先以中文撰寫並發佈於 aliyun.com，亦設英文版本，僅作資訊用途。本網站不對文章的準確性，完整性或可靠性或其任何翻譯作出任何明示或暗示的陳述或保證。如對該文章有任何疑慮或投訴，請傳送電郵至 info-contact@alibabacloud.com 並提供相關疑慮或投訴的詳細說明。職員會於 5 個工作天內與您聯絡，一經驗證之後，即會刪除該侵權內容。

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More