The machine learns the actual combat textbook and the Code to carry on the intensive reading, helps oneself progress.
#coding: utf-8from numpy import *import operator# operator Module From os import listdir #os. Listdir () method returns a list of the names of the files or folders that the specified folder contains. This list is in alphabetical order. It does not include '. ' and ' ... ' Even if it is in a folder. #创建数据集和标签def createdataset (): group = array ([[1.0,1.1],[1.0,1.0],[0,0],[ 0,0.1]]) #数据集 #python中的list是python的内置数据类型, the data classes in the list do not have to be the same, and the types in the array must all be the same. The data type in the list holds the address of the data, simply the pointer, not the data, so it is too troublesome to save a list, for example, list1=[1,2,3, ' a '] requires 4 pointers and four data, which increases the storage and consumption of the CPU. labels = [' A ', ' B ', ' C ', ' D '] #标签 return group,labels# implements KNN algorithm #欧氏距离公式: Euclidean metric (euclidean Metric) (also called Euclidean distance) is a commonly used distance definition that refers to the true distance between two points in an m-dimensional space, or the natural length of the vector (that is, the distance from the point to the origin). Euclidean distance in two and three-dimensional space is the actual distance between two points def classify0 (inx, dataset, labels, k): #inX: Input vectors for classification; DataSet: Input training sample set; labels: Tag vector; k: Select the number of nearest neighbors datasetsize = dataset.shape[0] #shape函数它的功能是读取矩阵的长度, such as Shape[0], is the length of the first dimension of the reading matrix. Its input parameters can make an integer representation of a dimension, or it can be a matrix. diffmat = tile (inx, (datasetsize,1)) - dataset # His function is to repeat an array. For example, Tile (a,n), function is to repeat array a n times, to form a new array sqDiffMat = diffMat**2 sqdistances = sqdiffmat.sum (Axis=1) #平时用的sum应该是默认的axis =0 is the normal addition when added axis= After 1, you add a vector of each line of a matrix distances = sqDistances**0.5 Sorteddistindicies = distances.argsort () #sort函数只定义在list中, The sorted function can be defined for all iterated sequences. The #argsort () function, which is a function in the NumPy library, returns the index value from the array value from small to large . classcount={} for i in range (k) : voteilabel = labels[sorteddistindicies[i]] &nbSp; classcount[voteilabel] = classcount.get (voteilabel,0) + 1 sortedclasscount = sorted (Classcount.iteritems (), key=operator.itemgetter (1), Reverse=true) #key: Use a property and function of a list element as a keyword, There is a default value, an item in the Iteration collection #reverse: Collation . reverse = true or reverse = false, with a default value. Return value: Is a sorted, iterative type #operator模块提供的itemgetter函数用于获取对象的哪些维的数据 with some ordinal numbers (that is, the ordinal number of the data that needs to be fetched in the object) return sortedclasscount[0][0] #step01 : Because of the direct use of the people's files, so we do not collect data in this step, We can use Python crawlers for nautical data collection #step02 : prepare data: Parse data from a text file to get the value Def file2matrix (filename) required for distance calculations: fr = open (filename) #打开文件, assigned to fr numberoflines = Len (Fr.readlines ()) #get the number of lines in the file Returnmat = zeros ((numberoflines,3)) #创建给定类型的矩阵, and initialized to 0, Another dimension is set to a fixed numeric value of 3 classlabelvector = [] Fr.close () #有打开就要有关闭 fr = open (filename) index = 0 for line in fr.readlines (): #.readline () and . The difference between readlines () is that the latter reads the entire file one at a time, like .read () : ReadLines () automatically parses the contents of a file into a list of rows that can be created by python for ... in ... structure for processing. On the other hand,. ReadLine () reads only one row at a time, usually much slower than .readlines () . Only if there is not enough memory to read the entireFile, you should only use .readline () line = line.strip () #截取掉所有的回车字符. listfromline = line.split (' t ') #使用tab字符 \ t splits the entire row of data from the previous step into a single list returnmat[index,:] = listfromline[0:3] #选取前三个元素, store them in the feature matrix Classlabelvector.append (int (listfromline[-1)) #将列表中最后一列存储到向量classLabelVector中 index += 1fr.close () return returnmat,classlabelvector #step02: Preparing data: Normalized value # when dealing with eigenvalues of this different range of values, we usually use the method of normalization of values #newvalue = ( Oldvalue-min)/(max-min) convert the eigenvalues of any range of values to a value of 0 to 1 intervals def autonorm (dataSet): Minvals = dataset.min(0) #从列中选取最小值, not the minimum value of the current line maxVals = Dataset.max (0) ranges = maxVals - minVals #算出来数值范围 normdataset = zeros (Shape (dataSet)) m = dataSet.shape[0] Normdataset = dataset - tile (minvals, (m,1)) normDataSet = normdataset/tile (ranges, (m,1)) #element wise divide return normdataset, ranges, minvals#step03 : Profiling data: Creating a scatter plot with matplotlib #step04: Test algorithm: As a complete program validation classifier def datingclasstest (): horatio = 0.50 #hold out 10% datingDataMat,datingLabels = File2matrix ('./datingtestset2.txt ') #load data setfrom file normmat , ranges, minvals = autonorm (Datingdatamat) m = Normmat.shape[0] numtestvecs = int (M*horatio) Errorcount = 0.0 for i in range (numTestVecs): classifierresult = classify0 (Normmat[i,:],normmat[numtestvecs:m, :],datinglabels[numtestvecs:m],3) print "the classifier came back with: %d, the real answer is: %d " % (Classifierresult, datinglabels[i]) if ( Classifierresult != datinglabels[i]): errorcount += 1.0 print "the total error rate is:&Nbsp;%f " % (Errorcount/float (numtestvecs)) print errorCount#step05 Use algorithm: Build a complete usable system Def classifyperson (): resultlist = [' Not at all ', ' In small doses ', ' in large doses ']percenttats = float (raw_input ("Percentage of time spent palying video games? ")) Ffmiles = float (Raw_input ("Freguent filer miles earned per year?")) Icecream = float (Raw_input ("Liters of ice cream consumed per year?")) Datingdatamat,datinglabels = file2matrix ('./datingtestset2.txt ') normmat,ranges,minvales = Autonorm (Datingdatamat) Inarr = array ([Ffmiles,percenttats,icecream]) classifierresult = Classify0 ((inarr - minvales)/ranges,normmat,datinglabels,3) print "you will probably like this person: ", Resultlist[classifierresult -1]
Focus:
The 01:k-nearest neighbor algorithm is a Euclidean distance formula that calculates the true distance between two points in m-dimensional space, or the natural length of a vector.
02: Normalized Value:
NewValue = (oldvalue-min)/(Max-min) converts the eigenvalues of any range of values to values from 0 to 1 intervals
This idea is very important.
Experience: In my opinion, the entire machine learning from data collection to the final program, the whole process is particularly important, the algorithm is the core, dealing with interference items, we use the normalization.
This article from "Shangwei Super" blog, declined reprint!
--------K-Nearest neighbor algorithm for machine learning in actual combat