Machine Learning (a): Remember the study of K-one nearest neighbor algorithm and Kaggle combat

Source: Internet
Author: User
Tags ranges

This blog is based on Kaggle handwritten numeral recognition in combat as the goal, with KNN algorithm learning as the driving guidance to explain.

    1. The reason for writing this blog
    2. What is KNN
    3. The analysis of KNN
    4. Kaggle Combat
    5. Advantages and disadvantages and optimization methods
    6. Summarize
    7. Reference documents
The reason for writing this blog

Machine learning is very hot in the field of artificial intelligence, but many people can not understand and learn this technology for various reasons, but I think, in today's times, understanding and learning machine learning is very necessary. So why does the author in today's many blogs are detailed introduction of machine learning, but also to write such a blog? This is because a lot of blogs are very classic, but in fact, a large part is scripted, for some of the mathematical foundation is not very good, the programming ability of the general people, may not be so suitable for getting started. Therefore, the author here on the recent learning of a machine learning algorithm one by one K nearest neighbor algorithm for a summary and actual combat, I hope to be able to learn some inspiration and help you get started.

So what is the K-nearest neighbor algorithm?

In pattern recognition and machine learning, K-Nearest neighbor algorithm (KNN) is a common classification method in supervised learning.

The analysis of KNN

KNN can be said to be the simplest algorithm in machine learning algorithm, I hope it can lead you into machine learning, understand the most basic principle, and apply to real life. KNN's working mechanism is very simple, it is a processing classification and regression problem of the non-parametric algorithm, in short, through a certain distance measurement, calculate the distance between the test set and training set, select the first k nearest distance training samples, from the K to select the most of the training samples appear in the type of sample to test sample type.

noun interpretation and case analysis: a handwritten numeral recognition as an example to illustrate:

Training set: A set of tagged digital images, each of which we have labeled, indicating how much this picture shows. In this case, all the images are stored in a matrix in the data set.

Test set: A set of digital images without labels, which gives a set of pictures, but does not label it, that is, what type it is, and we are not sure.

Classification: For example, handwritten numeral recognition, given a picture, we can clearly distinguish between the numbers written above, but the computer, and can not be effectively recognized, so one of the application of machine learning is to let the computer from the known classification situation, infer the category of unknown situation.

Regression: To take a function, a function in the image is continuous, and there is a certain regularity, we can use the function to calculate the unknown situation. A computer is a known situation, and then a simulation generates a function to fit such a model to infer an unknown situation.

Distance measurement: European distance, Manhattan distance, Chebyshev distance.

Sample: In this blog, each sample is a digital picture, test the set of samples, that is, each test sample is not categorized. The set of samples in the training set is clearly categorized.

Don't say much nonsense, start writing code!

Kaggle Combat

In Kaggle, there is a game of knowledge type. Well, it's your decision!

First, download the training set and test set from Kaggle. To open the training set, you can see that the training set is made up of 42000 digital images, which we can convert to a 420001 label Matrix and a 42000784 pixel matrix. (Note: The normaling function and the ToInt function are formatting the returned data.) The function is described later. )

# 读取Train数据def loadTrainData():    =‘train.csv‘    withopen‘r‘as f_obj:        =forin csv.reader(f_obj)]        f.remove(f[0])        = array(f)        = f[:,0]        = f[:,1:]        # print(shape(labels))        return normaling(toInt(datas)), toInt(labels)

Open the test set. Because the test set is not categorized, it does not have a label. We can convert this test set to 28000*784 's pixel matrix.

#读取Test数据def loadTestData():    =‘test.csv‘    withopen‘r‘as f_obj:        =forin csv.reader(f_obj)]        f.remove(f[0])        = array(f)        return normaling(toInt(f))

The Normaling function mentioned earlier is for normalization of the data set, and the purpose of normalization is to solve the comparability between data metrics and prevent some data from being too large, leading to a large deviation of the classification results.

#归一化数据def normaling(dataSet):    = dataSet.min(0)    = dataSet.max(0)    =- minVals    = dataSet.shape[0]    =1))    =-1))    =/ denominator    return normData

The ToInt function is because the data we get from the CSV file is a string type, but we measure the distance metric for numeric types, so we need to convert the string type to a numeric type.

#字符串数组转换整数def toInt(array):    = mat(array)    =shape(array)    = zeros((m, n))    forinrange(m):        forinrange(n):            =int(array[i,j])    return newArray

So our goal is to calculate the distance from each test sample in the test set to the training set, select the K training samples closest to the test set, and from the K samples, select the category with the most occurrences as the training sample. So the distance between the test sample and the training set is calculated as shown in the following code:

# Core CodedefK_nn (InX, DataSet, labels, k): Datasetsize=dataset.shape[0] Diffmat=Tile (InX, datasetsize,1))-DataSet Sqdiffmat=Diffmat**2Sqdistance=Sqdiffmat.sum(axis=1) Distances=Sqdistance**0.5Sortdisn=Argsort (distances)# Print ("Sortdisn shape:", Sortdisn.shape)    # Print ("Labels shape:", Labels.shape)ClassCount={} forIinch Range(k):# Print (Sortdisn[i])        # Print (Type (sortdisn[i)))Vote=Labels[sortdisn[i]]# Print ("Before:", type (vote))Vote= "'. Join (Map(Str, vote))# Print ("After:", type (vote))Classcount[vote]=Classcount.get (vote,0)+ 1Sortedd= Sorted(Classcount.items (), key=Operator.itemgetter (1), Reverse=True)returnsortedd[0][0]

By consolidating the above code, the data of the test set can be categorized.

#!/user/bin/python3#-*-Coding:utf-8-*-# @Date: 2018/6/24 19:35# @Author: SylerImportCsv fromNumPyImport *Importoperator# Core CodedefK_nn (InX, DataSet, labels, k): Datasetsize=dataset.shape[0] Diffmat=Tile (InX, datasetsize,1))-DataSet Sqdiffmat=Diffmat**2Sqdistance=Sqdiffmat.sum(axis=1) Distances=Sqdistance**0.5Sortdisn=Argsort (distances)# Print ("Sortdisn shape:", Sortdisn.shape)    # Print ("Labels shape:", Labels.shape)ClassCount={} forIinch Range(k):# Print (Sortdisn[i])        # Print (Type (sortdisn[i)))Vote=Labels[sortdisn[i]]# Print ("Before:", type (vote))Vote= "'. Join (Map(Str, vote))# Print ("After:", type (vote))Classcount[vote]=Classcount.get (vote,0)+ 1Sortedd= Sorted(Classcount.items (), key=Operator.itemgetter (1), Reverse=True)returnsortedd[0][0]#读取Train数据defLoadtraindata (): filename= ' Train.csv '     with Open(FileName,' R ') asF_obj:f=[x forXinchCsv.reader (F_obj)] F.remove (f[0]) F=Array (f) Labels=f[:,0] Datas=f[:,1:]# Print (Shape (labels))        returnNormaling (ToInt (datas)), ToInt (labels)#读取Test数据defLoadtestdata (): filename= ' Test.csv '     with Open(FileName,' R ') asF_obj:f=[x forXinchCsv.reader (F_obj)] F.remove (f[0]) F=Array (f)returnNormaling (ToInt (f))#归一化数据defNormaling (dataSet): minvals=DataSet.min(0) maxvals=DataSet.Max(0) ranges=Maxvals-Minvals m=dataset.shape[0] Denominator=Tile (ranges, (M,1)) Molecular=DataSet-Tile (Minvals, (M,1)) Normdata=Molecular/DenominatorreturnNormdata#字符串数组转换整数defToInt (array): array=Mat (Array) m, n=Shape (Array) NewArray=Zeros ((M, N)) forIinch Range(m): forJinch Range(n): Newarray[i,j]= int(Array[i,j])returnNewArray#保存结果defSaveresult (RES): with Open(' Res.csv ',' W ', newline="') asFw:writer=Csv.writer (FW) Writer.writerows (RES)if __name__ == ' __main__ ': DataSet, labels=Loadtraindata () Testset=Loadtestdata () row=testset.shape[0]# Print ("DataSet Shape:", Dataset.shape)    # Print ("Labels shape before", shape (labels))Labels=Labels.reshape (labels.shape[1],1)# Print ("labels shape after reshape", shape (labels))    # Print ("Testset Shape", Testset.shape)Reslist=[] forIinch Range(ROW): Res=K_nn (Testset[i], dataSet, labels,4) Reslist.append (RES)Print(i) Saveresult (reslist)

What about the results of this data submission to Kaggle? Let's take a look.
Overall, the results were satisfactory. After all, the KNN algorithm is a machine learning algorithm in the comparative basis of an algorithm, can have such a ranking has been good ~

Advantages:

Simple, easy to understand, easy to implement, no training required.
Suitable for classifying rare events.
Especially for multi-classification problems, KNN behaves better than SVM.

Disadvantages:

KNN algorithm is an instance-based learning or a kind of "lazy learning". When using algorithms, we must have training sample data that is as close to the actual data as possible, largely because it does not train the model as a step, causing it to have to save all the datasets. Once the data set is large, it will result in a lot of storage space. Plus, each time a sample is sorted or regressed, the distance value is calculated for each data set in the dataset, and the actual usage can be time consuming. Secondly, it is affected by "noise" very much, especially when the sample is unbalanced, it will lead to a large deviation of the classification results. Another drawback of this is that it is not possible to give the infrastructure information for any data, and it is not possible to know what the characteristics are between the test set and the training set.

Optimization method

Now the improvement of KNN algorithm is divided into two aspects: classification efficiency and classification effect.
A popular approach is to use evolutionary algorithms to optimize feature ranges.
A suitable K-value selection, through a variety of heuristic algorithms.
Both classification and regression are weighted according to distance measurements, making the neighboring values more average.

Summarize

KNN algorithm is the simplest and most effective algorithm for classifying data, which can help us to quickly understand the basic model of classification algorithm in supervised learning, and also help beginners to build up confidence in machine learning and learning. Of course, the most important thing is that I hope this blog will give you some knowledge and interest in machine learning.

Reference

"Machine learning Combat"
"Machine learning"
Wikipedia

GitHub Address: https://github.com/578534869/machine-learning
(Welcome follow, learn from each other, progress together!) :-) )

Machine Learning (a): Remember the study of K-one nearest neighbor algorithm and Kaggle combat

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.