--------K-Nearest neighbor algorithm for machine learning in actual combat

Source: Internet
Author: User
Tags ranges

The machine learns the actual combat textbook and the Code to carry on the intensive reading, helps oneself progress.

#coding: utf-8from numpy import *import operator# operator Module From os import listdir    #os. Listdir ()   method returns a list of the names of the files or folders that the specified folder contains. This list is in alphabetical order.   It does not include   '. '   and ' ... '   Even if it is in a folder. #创建数据集和标签def  createdataset ():     group = array ([[1.0,1.1],[1.0,1.0],[0,0],[ 0,0.1]])    #数据集      #python中的list是python的内置数据类型, the data classes in the list do not have to be the same, and the types in the array must all be the same. The data type in the list holds the address of the data, simply the pointer, not the data, so it is too troublesome to save a list, for example, list1=[1,2,3, ' a '] requires 4 pointers and four data, which increases the storage and consumption of the CPU.     labels = [' A ', ' B ', ' C ', ' D ']     #标签      return group,labels# implements KNN algorithm      #欧氏距离公式: Euclidean metric (euclidean  Metric) (also called Euclidean distance) is a commonly used distance definition that refers to the true distance between two points in an m-dimensional space, or the natural length of the vector (that is, the distance from the point to the origin). Euclidean distance in two and three-dimensional space is the actual distance between two points def classify0 (inx, dataset, labels, k): #inX: Input vectors for classification;  DataSet: Input training sample set;  labels: Tag vector;  k: Select the number of nearest neighbors     datasetsize = dataset.shape[0]    #shape函数它的功能是读取矩阵的长度, such as Shape[0], is the length of the first dimension of the reading matrix. Its input parameters can make an integer representation of a dimension, or it can be a matrix.     diffmat = tile (inx,  (datasetsize,1))  - dataset # His function is to repeat an array. For example, Tile (a,n), function is to repeat array a n times, to form a new array     sqDiffMat = diffMat**2     sqdistances = sqdiffmat.sum (Axis=1)   #平时用的sum应该是默认的axis =0  is the normal addition   when added axis= After 1, you add a vector of each line of a matrix     distances = sqDistances**0.5     Sorteddistindicies = distances.argsort ()         #sort函数只定义在list中, The sorted function can be defined for all iterated sequences. The #argsort () function, which is a function in the NumPy library, returns the index value from the array value from small to large .    classcount={}               for i in range (k) :        voteilabel = labels[sorteddistindicies[i]]     &nbSp;   classcount[voteilabel] = classcount.get (voteilabel,0)  + 1     sortedclasscount = sorted (Classcount.iteritems (),  key=operator.itemgetter (1),  Reverse=true)                                 #key: Use a property and function of a list element as a keyword, There is a default value, an item in the Iteration collection #reverse: Collation . reverse = true  or  reverse = false, with a default value. Return value: Is a sorted, iterative type      #operator模块提供的itemgetter函数用于获取对象的哪些维的数据 with some ordinal numbers (that is, the ordinal number of the data that needs to be fetched in the object)     return sortedclasscount[0][0] #step01  : Because of the direct use of the people's files, so we do not collect data in this step, We can use Python crawlers for nautical data collection #step02 :  prepare data: Parse data from a text file to get the value Def file2matrix (filename) required for distance calculations:     fr = open (filename) #打开文件, assigned to fr    numberoflines =  Len (Fr.readlines ())        #get  the number of lines in the file     Returnmat = zeros ((numberoflines,3))         #创建给定类型的矩阵, and initialized to 0, Another dimension is set to a fixed numeric value of 3    classlabelvector = []        Fr.close ()     #有打开就要有关闭                         fr = open (filename)      index = 0    for line in fr.readlines ():                 #.readline ()   and  . The difference between readlines ()   is that the latter reads the entire file one at a time, like  .read ()  : ReadLines ()   automatically parses the contents of a file into a list of rows that can be created by  python    for ... in ...  structure for processing. On the other hand,. ReadLine ()   reads only one row at a time, usually much slower than  .readlines ()  . Only if there is not enough memory to read the entireFile, you should only use  .readline ()         line = line.strip ()                         #截取掉所有的回车字符.         listfromline = line.split (' t ')       #使用tab字符 \ t splits the entire row of data from the previous step into a single list         returnmat[index,:] =  listfromline[0:3] #选取前三个元素, store them in the feature matrix          Classlabelvector.append (int (listfromline[-1)) #将列表中最后一列存储到向量classLabelVector中          index += 1fr.close ()     return returnmat,classlabelvector #step02:    Preparing data: Normalized value # when dealing with eigenvalues of this different range of values, we usually use the method of normalization of values   #newvalue  =  ( Oldvalue-min)/(max-min)    convert the eigenvalues of any range of values to a value of 0 to 1 intervals def autonorm (dataSet):     Minvals = dataset.min(0)        #从列中选取最小值, not the minimum value of the current line     maxVals =  Dataset.max (0)     ranges = maxVals - minVals    #算出来数值范围     normdataset = zeros (Shape (dataSet))       m  = dataSet.shape[0]              Normdataset = dataset - tile (minvals,  (m,1))     normDataSet  = normdataset/tile (ranges,  (m,1))     #element  wise divide     return normdataset, ranges, minvals#step03 : Profiling data: Creating a scatter plot with matplotlib #step04:  Test algorithm: As a complete program validation classifier def datingclasstest ():    horatio = 0.50        #hold  out 10%    datingDataMat,datingLabels =  File2matrix ('./datingtestset2.txt ')         #load  data setfrom file    normmat ,  ranges, minvals = autonorm (Datingdatamat)     m =  Normmat.shape[0]    numtestvecs = int (M*horatio)      Errorcount = 0.0    for i in range (numTestVecs):         classifierresult = classify0 (Normmat[i,:],normmat[numtestvecs:m, :],datinglabels[numtestvecs:m],3)         print  "the  classifier came back with: %d, the real answer is: %d " %   (Classifierresult, datinglabels[i])         if  ( Classifierresult != datinglabels[i]):  errorcount += 1.0    print   "the total error rate is:&Nbsp;%f " %  (Errorcount/float (numtestvecs))     print errorCount#step05    Use algorithm: Build a complete usable system Def classifyperson (): resultlist = [' Not at all ', ' In small  doses ', ' in large doses ']percenttats = float (raw_input ("Percentage of time  spent palying video games? ")) Ffmiles = float (Raw_input ("Freguent filer miles earned per year?")) Icecream = float (Raw_input ("Liters of ice cream consumed per year?")) Datingdatamat,datinglabels = file2matrix ('./datingtestset2.txt ') normmat,ranges,minvales =  Autonorm (Datingdatamat) Inarr = array ([Ffmiles,percenttats,icecream]) classifierresult =  Classify0 ((inarr - minvales)/ranges,normmat,datinglabels,3) print  "you will probably  like this person: ", Resultlist[classifierresult -1]

Focus:

The 01:k-nearest neighbor algorithm is a Euclidean distance formula that calculates the true distance between two points in m-dimensional space, or the natural length of a vector.

02: Normalized Value:

NewValue = (oldvalue-min)/(Max-min) converts the eigenvalues of any range of values to values from 0 to 1 intervals

This idea is very important.


Experience: In my opinion, the entire machine learning from data collection to the final program, the whole process is particularly important, the algorithm is the core, dealing with interference items, we use the normalization.

This article from "Shangwei Super" blog, declined reprint!

--------K-Nearest neighbor algorithm for machine learning in actual combat

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.