Python uses the k nearest neighbor (KNN) algorithm to classify mnist datasets and fashion mnist datasets

Source: Internet
Author: User



I. Introduction to KNN algorithm



K Nearest neighbor (k-nearest NEIGHBOR,KNN) classification algorithm is one of the simplest machine learning algorithms, which is theoretically more mature. The KNN algorithm first expresses the sample to be classified as the characteristic vector which is consistent with the training sample, then calculates the distance between the sample to be tested and each training sample according to the distance, selects K samples with the smallest distance as the nearest neighbor sample, and then judges the category of samples to be classified according to K nearest neighbor samples. The correct selection of KNN algorithm is one of the key factors of correct classification, and the nearest neighbor sample is selected by calculating the distance between the test sample and each training set sample, so it is the precondition to define proper distance for KNN to classify correctly.



On the basis of the above research, this paper considers the importance of the characteristic attribute value as equally important, and redefine the sample distance as the relative distance between the pixel points of any two samples, and the distance is used for distance calculation.



Second, the principle of the algorithm



The K-Nearest neighbor algorithm (KNN) works by the existence of a collection of sample data, also called a training sample set, and each data in the sample set has a label, that is, we know the corresponding relationship between each data in the sample set and the owning category. After entering new data without a label, each feature of the new data is compared with the characteristics of the data data in the sample set, and the algorithm extracts the classification labels of the most similar data (nearest neighbor) in the sample set. In general, we only select the first k most similar data in the sample data set, which is the source of K in the K-nearest neighbor algorithm, usually K is an integer not greater than 20. Finally, select the most frequently occurring classification of the K most similar data as the classification of the new data.



Collect and prepare data, using mnist DataSet and Fashion mnist DataSet, input sample data and structured output, can adjust K value, then run K-Nearest neighbor algorithm to determine which classification the input data belong to, and finally calculate the error rate and accuracy rate.



KNN algorithm (k proximity algorithm classification algorithm), is the K nearest neighbor, said that each sample can be used its closest to the K-neighbor to represent, the core idea is if a sample in the characteristic space of the k most adjacent samples of the majority belong to a category, then the sample belongs to this category, and has the characteristics of the sample on this category. The KNN algorithm can be used not only for classification, but also for regression. by locating the K nearest neighbor of a sample and assigning the average of the properties of those neighbors to the sample, you can get the properties of the sample. In KNN, by calculating the distance between objects as the non-similarity index between the objects, the matching problem between objects is avoided, and the distance used is Euclidean distance.



Detailed implementation: Import the Mnist dataset and the fashion Mnist dataset including the training set and the validation set into the project file, then calculate the distance between the validation set and the training set, and get the nearest K neighbors from small arrivals, and vote to get the category with the highest category, and judge the validation set of the picture belongs to the category, then the label of the category and the validation set of the label to compare, if the match is correct, if not, it is an error, the final output calculated error rate and accuracy.


Data Set introduction Mnist data set, training set 60000 pictures and tags; the test set has 10000 pictures and labels. After reading the 28*28 picture, convert each picture to a vector of 1*784. The KNN algorithm implementation and the result analysis code implementation:
From numpy Import *
Import operator
Import OS
Import NumPy as NP
Import Matplotlib.pyplot as Plt
From matplotlib import cm
From OS import listdir
From Mpl_toolkits.mplot3d import Axes3d
Import struct

#Reading pictures
def read_image (file_name):
#Read the files in binary mode first
File_handle = open (file_name, "RB") #Open the document in binary
File_content = file_handle.read () #Read into the buffer

Offset = 0
Head = Struct.unpack_from ('> iiii', file_content, offset) # takes the first 4 integers, returns a tuple
Offset + = struct.calcsize ('& GT; IIII')
Imgnum = head [1] #Number of pictures
rows = head [2] #width
cols = head [3] #height
# Print (Imgnum)
# print (rows)
# Print (cols)

#Test whether to read an image successfully
#im = Struct.unpack_from ('> 784b', file_content, offset)
#offset + = struct.calcsize ('> 784b')

Images = np.empty ((Imgnum, 784)) #empty, is that all elements in the array that it is common are empty and have no practical meaning, it is the fastest way to create an array
image_size = rows * cols # the size of a single picture
Fmt = '>' + str (image_size) + 'B' #format of a single image

For I in Range (Imgnum):
Images [i] = Np.array (Struct.unpack_from (FMT, file_content, offset))
# Images [i] = Np.array (Struct.unpack_from (FMT, file_content, offset)). Reshape ((rows, cols))
Offset + = struct.calcsize (FMT)
return images

'Bits = imgnum * rows * cols # Data has a total of 60000 * 28 * 28 pixel values
bitsstring = '>' + str (bits) + 'B' # FMT format: '> 47040000b'
IMGs = Struct.unpack_from (bitsstring, file_content, offset) # fetch data, return a tuple
Imgs_array = np.array (IMGs). Reshape ((imgnum, rows * cols)) #Finally reshape the read data into "Pictures, picture pixels" two-dimensional array
return Imgs_array "'

#Read tags
def read_label (file_name):
File_handle = open (file_name, "RB") # Opens the document in binary
File_content = File_handle.read () # read into buffer

Head = Struct.unpack_from ('> ii', file_content, 0) # takes the first 2 integers, returns a tuple
offset = struct.calcsize ('> ii')

Labelnum = head [1] # label number
# Print (Labelnum)
bitsstring = '>' + str (labelnum) + 'B' # FMT format: '> 47040000b'
Label = Struct.unpack_from (bitsstring, file_content, offset) # takes data, returns a tuple
Return Np.array (label)

#KNNAlgorithm
def KNN (Test_data, DataSet, labels, k):
Datasetsize = dataset.shape [0] #dataSet. Shape [0] Represents the length of the first dimension of the read matrix, representing the number of rows
# Distance1 = Tile (Test_data, (datasetsize, 1))-dataset # Euclidean distance calculation starts
# print ("Datasetsize:")
# Print (datasetsize)
Distance1 = Tile (Test_data, (datasetsize)). Reshape ((60000,784))-dataset # tile function repeats Datasetsizec times on rows, repeating 1 times on a column
# print ("Distance1.shape")
# Print (Distance1.shape)
Distance2 = distance1 ** 2 #Square each element
Distance3 = Distance2.sum (Axis = 1) #Add each row of the matrix
Distances4 = distance3 ** 0.5 # Euclidean distance calculation ends
# print (distances4 [53843])
# print (distances4 [38620])
# print (distances4 [16186])
Sorteddistindicies = Distances4.argsort () #Returns the indexes sorted from small to large
Classcount = np.zeros ((), Np.int32) # 10 represents 10 classes
For I in range (k): #Count the number of the first k data classes
Voteilabel = Labels [sorteddistindicies [i]]
Classcount [voteilabel] + = 1
Max = 0
ID = 0
Print (Classcount.shape [0])
# print (classcount.shape [1])

For I in range (Classcount.shape [0]):
If Classcount [i]> = Max:
max = Classcount [i]
id = i
Print (ID)

# Sortedclasscount = sorted (Classcount.iteritems (), Key = operator.itemgetter (1), reverse = true) #Sort by number of categories from large to small
Return ID

Def TEST_KNN ():
# File Acquisition
#mnistData Set
# train_image = "F: \ mnist \\ train-images-idx3-ubyte"
# test_image = "F: \ mnist \\ t10k-images-idx3-ubyte"
# Train_label = "F: \ mnist \\ train-labels-idx1-ubyte"
# Test_label = "F: \ mnist \\ t10k-labels-idx1-ubyte"
#fashion mnist Data Set
Train_image = "Train-images-idx3-ubyte"
Test_image = "T10k-images-idx3-ubyte"
Train_label = "Train-labels-idx1-ubyte"
Test_label = "T10k-labels-idx1-ubyte"
# Read Data
train_x = Read_image (train_image) # Train_dataset
test_x = Read_image (test_image) # Test_dataset
train_y = Read_label (train_label) # Train_label
test_y = Read_label (test_label) # Test_label

# Print (Train_x.shape)
# Print (Test_x.shape)
# Print (Train_y.shape)
# Print (Test_y.shape)
# plt.imshow (Train_x [0])
# plt.show ()

Testratio = 1 # The first 0.1 of the data set is the test data, and the specific gravity of the parameter can be changed
Train_row = train_x.shape [0] # Number of rows of the dataset, that is, the total number of samples in the dataset
TEST_ROW = TEST_X.SHAPE [0]
testnum = Int (Test_row * testratio)
Errorcount = 0 # Number of error judgments
For I in Range (Testnum):
result = KNN (Test_x [i], train_x, train_y, 30)
# print (The result returned is:% s, the real result is:% s'% (result, train_y [i]))

Print (result, test_y [i])
If result! = Test_y [i]:
Errorcount + = 1.0 # Error if the label of the Mnist validation set is not the same as the label itself
Error_rate = Errorcount / float (testnum) # calculation error Rate
ACC = 1.0-error_rate
Print (Errorcount)
Print ("\ nthe total number of errors is:% d"%
Results Analysis:errorcount)
Print ("\ nthe total error rate is:% f"% (error_rate))
Print ("\ nthe total accuracy rate is:% f"% (ACC))

if __name__ = = "__main__":
The TEST_KNN () #test () function calls the function that reads the dataset, and calls the classification function to classify the data set, and finally calculates the classification





Input: mnist DataSet or Fashion mnist dataset



Output: Error rate and accuracy



Mnist Data set:



Take k=30, the verification set is 50, the accuracy rate is 1;



Take k=30, the verification set is 500, the accuracy rate is 0.98;



Take k=30, the validation set is 10,000, the accuracy rate is 0.84.



Fashion mnist Data Set



K=30, when the validation set is 10000, the total number of errors is 1666, and the accuracy rate is 0.8334.



In this paper, the data set using KNN algorithm to obtain a high accuracy, but in this paper, considering the importance of feature attribute value to category judgment, the improvement of the algorithm should consider the importance of the attribute value of the class judgment, the correlation distance between the two samples can be used to measure the importance of the attribute value to the category, the smaller the correlation distance entropy, The greater the similarity of the two samples, the greater the confidence of the class, and in addition, the different values of K should be tested separately to obtain the higher accuracy of k, at the same time, in the experiment of multiple k, can use multi-threaded running experiment, shorten the time.








Python uses the k nearest neighbor (KNN) algorithm to classify mnist datasets and fashion mnist datasets


Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.