"Machine Learning Algorithm Implementation" series of articles will record personal reading machine learning papers, books in the process of the algorithm encountered, each article describes a specific algorithm, algorithm programming implementation, the application of practical examples of the algorithm. Each algorithm is programmed to be implemented in multiple languages. All code shares to Github:https://github.com/wepe/machinelearning-demo Welcome to the Exchange!
(1) KNN algorithm _ Handwriting Recognition example--based on Python and numpy function library1, KNN algorithm introduction
KNN algorithm, k nearest neighbor (K-nearestneighbor) classification algorithm, is one of the simplest machine learning algorithms, the algorithm idea is simple: from the training sample set to select K and test sample "distance" the most recent sample, The most frequently occurring category in the K sample is the category of the test sample. The following introduction is selected from the Wiki encyclopedia: http://zh.wikipedia.org/wiki/%E6%9C%80%E8%BF%91%E9%84%B0%E5%B1%85%E6%B3%95
Method
- Objective: Classify cases of unknown category.
- Input: Unknown category case item to classify. Known Categories case collection D, which contains a case of J known categories.
- Output: The possible categories of the project.
Steps
Such as
We consider the problem of two classification using KNN method in the case of two-dimensional sample. The triangles and squares in the diagram are the sample points of the known categories, where we assume that the triangle is a positive class and that the square is a negative class. The circle points in the graph are data of unknown categories, and we want to classify them with samples of these known categories.
The classification process is as follows:
1 First we set the K value beforehand (that is, the K-nearest neighbor method, the size of K, for a data point to be classified, we are looking for several of its neighbors). Here to illustrate the problem, we take two k values, 3 and 5, respectively;
2 based on a predetermined distance measurement formula (e.g. Euclidean distance), the nearest K samples are obtained from the sample points of the data points to be classified and all known categories.
3 Statistics The number of each category in this K-sample point. For example, if we select a K value of 3, the positive class sample (triangle) has 2, negative class sample (square) has 1, then we will make this circular data point as a positive class, and if we choose K value is 5, then the positive class sample (triangle) has 2, negative class sample (square) has 3, then we this data point is negative class. That is, based on the number of samples in the K sample, we set the data point to what category.
Add:
Advantages and Disadvantages
(1) Advantages:
The algorithm is simple, easy to implement, does not require parameter estimation and does not require prior training.
(2) Disadvantages:
belongs to the lazy algorithm, "usually do not study hard, test only cramming", meaning that the KNN does not have to train beforehand, but in the input to be classified samples only to start running, this feature leads to KNN calculation is particularly large, and training samples must be stored locally, memory overhead is particularly large.
Value of K:
The value of the parameter k is generally not greater than 20. --"machine learning Combat"
2. Handwriting Recognition ExampleKNN algorithm is mainly applied to text classification and similarity recommendation. This article will describe an example of a classification, an example in the book "Machine Learning Combat", using the Python language and the numerical Computing library NumPy. The following is a brief introduction to the use of Python, numpy functions in the development of this example programming.
2.1 python, numpy function
The NumPy library always contains two basic types of data: matrices and arrays, which are used in a similar way to MATLAB, and this example uses arrays array.
Shape ()
Shape is a method in the NumPy library that is used to view the dimensions of a matrix or array
>>>shape (Array) returns if the matrix has m row n columns (m,n)
>>>ARRAY.SHAPE[0] Returns the number of rows of the matrix M, with a parameter of 1, returns the Count of columns N
Tile ()
Tile is a method in the NumPy function library, and is used as follows:
>>>tile (A, (m,n)) constructs an array of M row n columns as an element of array a
SUM ()
SUM () is a method in the NumPy function library
>>>array.sum (Axis=1) by row, axis=0 for column-based accumulation
Argsort ()
Argsort () is a method in NumPy that gets the ordered ordinal of each element in the matrix
>>>a=array.argsort () A[0] represents the subscript of the first number of rows in the original array
Dict.get (key,x)
The method of a dictionary in Python, get (key,x) gets the value corresponding to key from the dictionary, and returns 0 if there is no key in the dictionary.
Sorted ()
Methods in Python
Min (), Max ()
The NumPy has the min (), Max () method, and is used as follows
>>>array.min (0) returns an array in which each number is the minimum of all the numbers of the columns in which it is located
>>>array.min (1) returns an array in which each number is the minimum of all the number of rows in the array
Listdir (' str ')
Methods in Python's operator
>>>strlist=listdir (' str ') reads all filenames under directory str and returns a list of strings
Split ()
Methods in Python, slice functions
>>>string.split (' str ') is sliced with the character str as a delimiter and returns the list
For more functions in NumPy, you can check the official website: http://docs.scipy.org/doc/
2.2 Programming for "handwriting recognition"the concept of handwriting recognition: refers to the trajectory information generated when writing on a handwriting device is translated into a specific loadline. The handwriting recognition system is a large project that recognizes Chinese characters, English, numerals, and other characters. This article is just a small demo, focusing not on handwriting recognition but on understanding KNN, so it only recognizes 0~9 single numbers.
input format: each handwritten digit has been processed in advance into 32*32 binary text, stored as a TXT file. 0~9 each number has 10 training samples and 5 test samples. Training sample sets such as:
Open 3_3.txt This file to see:
the above background is finished, and now the programming is implemented, presumably divided into
Three steps:(1) Convert each picture (that is, txt text, the following refers to the picture txt text) into a vector, that is, the 32*32 array into an array of 1*1024, this 1*1024 array of machine-learning terminology is the eigenvector.
(2) There are 10*10 pictures in the training sample, which can be combined into a 100*1024 matrix, each line corresponding to a picture. (This is to facilitate calculation, many machine learning algorithms use matrix operations when computing, can simplify the code, and sometimes reduce computational complexity).
(3) There are 10*5 images in the test sample, we want the program to automatically determine the number represented by each image. Similarly, for a test picture, convert it to a vector of 1*1024, and then calculate its "distance" from each picture in the training sample (the distance between the two vectors is Euclidean distance), and then the distance is sorted to select the smaller of the first k, because the K samples from the training set, is known its representative of the number, So the number represented by the picture being tested can be determined as the number that has the most occurrences of this k.
First step: Convert to 1*1024 eigenvector. The filename in the program is a file name, such as 3_3.txt
<span style= "FONT-FAMILY:SIMSUN;FONT-SIZE:18PX;" > #样本是32 *32 Two-value image, processing it into 1*1024 's eigenvector def img2vector (filename): returnvect = Zeros ((1,1024)) fr = Open ( filename) for i in range (+): linestr = Fr.readline () for J in Range (+): returnvect[0,32*i+j] = Int ( LINESTR[J]) return returnvect</span>
The
second step, the third step: combine the training set picture into a large matrix of 100*1024, and classify the sample of the test set
<span style= "FONT-FAMILY:SIMSUN;FONT-SIZE:18PX;" >def handwritingclasstest (): #加载训练集到大矩阵trainingMat hwlabels = [] trainingfilelist = Listdir (' trainingdigits ') #os模块中的listdir (' str ') can read all filenames under directory str and return a list of strings m = Len (trainingfilelist) Trainingmat = Zeros ((m,1024)) For I in range (m): Filenamestr = Trainingfilelist[i] #训练样本的命名格式: 1_120.txt filestr = Filena Mestr.split ('. ') [0] #string. Split (' str ') with the character Str as a delimiter slice, return to list, here to List[0], get like 1_120 such classnumstr = Int (Filestr.split ( ' _ ') [0]) #以_切片, get 1, i.e. category Hwlabels.append (CLASSNUMSTR) trainingmat[i,:] = Img2vector (' trainingdigits/ %s '% filenamestr) #逐一读取测试图片 while classifying it testfilelist = Listdir (' testdigits ') errorcount = 0.0 MT EST = len (testfilelist) for I in Range (mtest): Filenamestr = testfilelist[i] Filestr = Filenamestr.split ( '.') [0] classnumstr = Int (Filestr.split ('_') [0]) VectorunderTest = Img2vector (' testdigits/%s '% filenamestr) Classifierresult = Classify0 (Vectorundertest, Trainingmat, HwLabel S, 3) print "The classifier came back with:%d, the real answer is:%d"% (Classifierresult, classnumstr) if (Classifierresult! = classnumstr): Errorcount + 1.0 print "\nthe total number of errors is:%d"% errorcount print "\nthe total error rate is:%f"% (Errorcount/float (mtest)) </span>
The
function classify () is the classifier body function, calculates the Euclidean distance, and finally returns the test picture category:
<span style= "FONT-FAMILY:SIMSUN;FONT-SIZE:18PX;" > #分类主体程序, calculates the Euclidean distance, selects K with the smallest distance, returns the category with the highest frequency in K #inx is the vector to be tested #dataset is the training sample set, one row corresponds to a sample. The DataSet corresponds to a label vector of Labels#k, which is the number of nearest neighbors def classify0 (InX, DataSet, labels, k): Datasetsize = dataset.shape[0] #shape [0] to derive the number of rows of the dataset, that is, the number of samples Diffmat = Tile (InX, (datasetsize,1))-DataSet #tile (A, (m,n)) An array of M-row n columns constructed as an element of array a Sqdiffmat = diffmat**2 sqdistances = sqdiffmat.sum (Axis=1) #array. SUM (Axis=1) to accumulate by rows, axis=0 to accumulate di by column stances = sqdistances**0.5 sorteddistindicies = Distances.argsort () #array. Argsort (), gets the sorted ordinal of each element Classco unt={} #sortedDistIndicies [0] indicates the rank of the first row in the previous array in the subscript for I in range (k): V Oteilabel = Labels[sorteddistindicies[i]] Classcount[voteilabel] = Classcount.get (voteilabel,0) + 1 #get (key,x) from the dictionary Get key corresponding to value, no key words returned 0 Sortedclasscount = sorted (Classcount.iteritems (), Key=operator.itemgetter (1), reverse=true ) #sorted () function, according to the second meta-The reverse order of the value (Reverse=true) is sorted by return sortedclasscount[0][0]</span>
3. Engineering DocumentsThe entire project file includes source code, training set, test set, and can be obtained from GitHub: GitHub address
Enter the use Python and NumPy directory, open the Python development environment, import the KNN module, and invoke the handwriting recognition function:
because the training set and the test set I used were small, I didn't happen to recognize the error:
"Machine Learning Algorithm Implementation" KNN algorithm __ Handwriting recognition--based on Python and numpy function library