Machine Learning Series-K-NEARESTNEIGHBO

Source: Internet
Author: User
Tags iterable ranges repetition andrew ng machine learning

This is the process of recording self-study, the current theoretical basis is: University advanced mathematics + linear algebra + probability theory. Programming Basics: C/c++,python
In watching machine learning combat this book, slowly involved. I believe that the people who have read the above courses can begin to learn machine learning, of course, I above the above-mentioned courses of the general, so you only know that there is such a formula or noun, do not understand Baidu can be studied. In writing this article, the author of machine learning has not finished, so the error in the article also please point out. Again, the series of articles just share the learning process, learning drip, can not guarantee the technical content of the article. In the continuous improvement of the subsequent technology, I will re-read this learning process and correct the mistakes.
The current learning methods are as follows:
Theory: Stanford Andrew NG Machine Learning, a series of courses.
Code Combat: Reference Book machine learning Combat (python).
Real gun Combat: A case study, modeling the selection of appropriate machine learning algorithms, analysis of data.
Evolution: A part of the machine learning process is implemented in Linux based on C + +. will be realized in the middle stage of learning.

The concept of a K nearest neighbor algorithm

The K-Nearest neighbor algorithm (KNN) belongs to a classification algorithm, which belongs to the same kind of logistic regression. It works by the existence of a set of sample data, each of which consists of a number of features and an established label corresponding to the label, which indicates which classification this sample point belongs to, if it is a simple two-tuple classification, then the label can be assumed to be 1 or 0, and so on. If you enter new data, the purpose of the algorithm is to output the label of the new data. This new data will be compared with each data in the sample set (to find the distance between two sample points), the results of the comparison from small to large, take the first k comparison results, the most frequently occurring classification is the new data classification.

Advantages and disadvantages of the nearest neighbor algorithm:

1. Advantages:

精度高,对异常值不敏感,无数据输入的假定

2. Disadvantages:

计算复杂度高,空间复杂度高。

3. Use the data range:

数值型和标称型。
Second, k nearest neighbor algorithm Simple example

Pseudo code:

1. Calculate the distance between the known point in the sample collection and the current point
2. Sort in ascending order of distance
3. Select the top K sample data that is closest to the current point in a sort
4. Calculate the number of samples per category and calculate the frequency
5. Returns the classification category of the most frequently occurring point in K points as the forecast classification for the current point.

Take a look at this simple example of how to find the Euclidean distance between two points:

 def classify0(InX, DataSet, labels, k):Datasetsize = dataset.shape[0] Diffmat = Tile (InX, (Datasetsize,1))-DataSet Sqdiffmat = diffmat**2Sqdistances = Sqdiffmat.sum (axis=1) Distances = sqdistances**0.5Sorteddistindicies = Distances.argsort () classcount={} forIinchRange (k): Voteilabel = Labels[sorteddistindicies[i]] Classcount[voteilabel] = Classcount.get (Voteilabel,0) +1Sortedclasscount = sorted (Classcount.iteritems (), Key=operator.itemgetter (1), reverse=True)returnsortedclasscount[0][0]

= = in-depth analysis: = =

1.dataSetSize = Dataset.shape[0]. Calculates the number of rows of the DataSet matrix.
2.diffMat = Tile (InX, (datasetsize,1))-the dataset side is mainly to subtract the input data from the sample, this way Pyth tile usage I summed up with my own understanding:

On the Internet to check the tile usage of the tutorial, found a little hard to see, so I personally studied and then found that it can be understood, share with you, there is something wrong place please point out:
Summary of 1.tile Usage:

Definition: Tile (a,r), A and R can all be arrays or ordinary individual data. The result is a repetition of R, which has the following mechanisms:

Assume:
Tile ([a1,a2,..., an],[r1,r2,..., RN]) =b
Some properties of the resulting B are determined by R, and the first dimension of B is R1, and the second dimension is the result of the repetition of the A-array rn through the r2,.....,b of each dimension. When the length of R is 1 o'clock (similar to (2), length 3, (3) length is 1), the R controls the dimension.

Example 1.
Tile ([1,2],2) =
Array ([+],
[+]);

Tile ([1,2],4) =
Array ([+],
[+],
[+],
[+]);

Tile ([up], ()) =//1 decision result is only one dimension, 2 decided to repeat a (two) times
Array ([[[1,2,1,2]])

Tile ([up], (3,3)) =//3 determines that the result is three-dimensional, and the next three decisions are repeated three times
Array ([[[1,2,1,2,1,2],
[1,2,1,2,1,2],
[1,2,1,2,1,2]])

Tile ([up], (2,2,2))//The first 2 determines that the result will be a 2-D, the second 2 decision 2 is a 2 dimension for each of the dimensions, and the last 2 determines that the element of the innermost dimension will be repeated two times a result, layered nesting.

Results:
Array ([[[[[1], 2, 1, 2],
[1, 2, 1, 2]],///Here is the dimension dividing line, note the number of brackets

   [[1, 2, 1, 2],    [1, 2, 1, 2]]])

And so on

3. Sqdiffmat = diffmat**2 //To square the results just obtained4. sqdistances = Sqdiffmat.sum (axis=1) Distances = sqdistances**0.5  The meaning of these two sentences is to add the result after the square, and then the result is5. sorteddistindicies = Distances.argsort ()//The input data to be obtained and the distance between each point is incrementally sorted, Argsort returns the index of each data. 6. forIinchRange (k): Voteilabel = Labels[sorteddistindicies[i]] Classcount[voteilabel] = Classcount.get (Voteilabel,0) +1          //The function of the overall selection of the minimum distance of K points. The first sentence in the For loop: Sorteddistindicies[i] records a row of index such as (2,3,0,1)        //2 represents the data with index 2 in the dataset is the shortest distance from the input data, and so on, 1 represents the longest. Since datasets and labels are corresponding        //, so when I get from 0>>k, I find the lable value corresponding to the smallest number of data. The second line classifies these labels and        //saved in ClassCount7. Sortedclasscount = sorted (Classcount.iteritems (), key=operator. Itemgetter (1),Reverse=True)//above is a function that has multiple types of data to sort, in short, the function is to save the shortest distance of the K data in each label appearsthe array of//number is sorted from large to small, the final result is saved in Sortedclasscount, the first element is the highest frequency label sortedclasscount[0][0 in the previous K data, and the usage of sorted is described in Appendix 1.
The practical application of K-Nearest neighbor algorithm-prediction of dating sites

Background I would not describe, in short, the purpose is to enter a data, this data contains three characteristics: 1. Number of frequent flyer miles per year 2. Percentage of time spent playing video games 3. Number of ice cream litres consumed per week. The three characteristics of the data classification, of course, the principle is to calculate the distance between the data and other data, the difference is that this is a three-dimensional space in the distance.

1. First step: Parse the data.
Let's look at the characteristics of the data:

The first column is the number of miles flown, the second column is the percentage of game time, and the third is the number of litres of ice cream eaten. The fourth column is the lable value, there are three kinds: do not like, charm general, very attractive. The book provides the code to parse the data, the purpose is to put these data regularity input into an array, convenient for later processing, the specific code:

 def file2matrix(filename):FR = open (filename) numberoflines = Len (Fr.readlines ())#get The number of lines in the fileReturnmat = Zeros (NumberOfLines,3))#prepare Matrix to returnClasslabelvector = []#prepare Labels returnFR = open (filename) index =0     forLineinchFr.readlines (): line = Line.strip () Listfromline = Line.split (' \ t ') Returnmat[index,:] = listfromline[0:3] Classlabelvector.append (int (listfromline[-1])) Index + =1    returnReturnmat,classlabelvector

The code above runs the following procedure:
Input: Datingdatamat,datainglabels = Knn.file2matrix (' datingTestSet2.txt ') Remember this argument is DatingTestSet2.txt, the kind of book that will report errors:

==valueerror:invalid literal for int. () with base: ' largedoses ' = =

Then look at the results of the conversion:
Datingdatamat:

The result of the label:

This way the data is in the format we need from the file, and the algorithm is used to process the data later.

2. Step two: Normalized values:
As shown in the data, the first column of data is 10 lower than the next two columns, which results in a decrease in the data effect of the second third column when the distance between points and points is calculated, and does not even work. So this way, all the data in accordance with a certain calculation method is normalized to 0-1, so that the addition and subtraction between the data will make sense.

The following formula can convert data to 0-1:

newValue = (oldValue - minValue)/(maxValue - minValue)

Where MinValue and MaxValue are the minimum and maximum values in the data respectively.

The Python code is implemented as follows:

def autoNorm(dataSet):    minVals = dataSet.min(0) //获得dataSet中每列的最小值保存到minVals    maxVals = dataSet.max(0)    ranges = maxVals - minVals    normDataSet = zeros(shape(dataSet))    m = dataSet.shape[0]    normDataSet = dataSet - tile(minVals, (m,1)) //oldValue-minValue    normDataSet = normDataSet/tile(ranges, (m,1))   #element wise divide    return normDataSet, ranges, minVals

The results of the operation are as follows:

3. Step Three: Test algorithm

Direct code:

 def datingclasstest():HoRatio =0.10      #hold out 10%Datingdatamat,datinglabels = File2matrix (' DatingTestSet2.txt ')#load data setfrom fileNormmat, ranges, minvals = Autonorm (datingdatamat) m = normmat.shape[0] Numtestvecs = Int (m*horatio) Errorcount =0.0     forIinchRange (numtestvecs): Classifierresult = Classify0 (normmat[i,:],normmat[numtestvecs:m,:],datinglabels[numtestvecs:m ],3)Print "The classifier came back with:%d, the real answer is:%d"% (Classifierresult, datinglabels[i])if(Classifierresult! = Datinglabels[i]): Errorcount + =1.0    Print "The total error rate is:%f"% (Errorcount/float (numtestvecs))

The first two sentences are to load the data from the file and then convert the normalization. m = normmat.shape[0] Calculated is the total number of data, after the number of test data taken out by Numtestvecs = M*horatio, this hoRatio for 0.1 that is, the test data take 10%. The test data and the existing data and label values are then entered into the classifier classfy0, which has been parsed before. The first parameter entered here is the data to be tested: Normmat[i,:], this expression means to take the Normmat line of data, until numtestvecs such as:

Normmat[0,:]
Array ([0.44832535, 0.39805139, 0.56233353])

Normmat[1,:]
Array ([0.15873259, 0.34195467, 0.98724416])

The second parameter: datinglabels[numtestvecs:m], which indicates that the data has been measured from the beginning of line numtestvecs to the end, the preceding is the data to be tested.

Here is the result of the run:

You can see that the last error rate is 6.4%, which is a good result. Here, we can change the error rate by changing the value of the Horotio and the variable K.

The above is just a test algorithm, the data directly from the existing data extracted from the following to build a complete algorithm system:

Here's the full code:

 def Classifyperson():Resultlist = [' not at all ',' in small doses ',' in large doess '] Percenttats = float (raw_input ("Percentage of time spent playing video games?")) Ffmiles = float (raw_input ("Frequent flier miles earned per year?")) Icecream = float (raw_input ("liters of ice cream consumed per year?")) Datingdatamat,datinglabels = File2matrix (' DatingTestSet2.txt ') Normmat,ranges,minvals = Autonorm (datingdatamat) Inarr = Array ([Ffmiles,percenttats,icecream]) ClassifierResult = Classify0 ((inarr-minvals)/ranges,normmat,datinglabels,3)Print "You'll probably like this person :", Resultlist[classifierresult-1]

The first line is an array of labels, with three categories.
2-4 lines are required to enter the game occupy time ratio, the number of miles flown, the consumption of ice cream litres.

The next two lines are the process of converting the data from the file and the normalization of the save.

Finally, the input classifier is classified, and the result of classification is returned.

Here is the result of the operation:

Appendix 1:
Sorted
Sorted function
Python's built-in sorting function sorted can be sorted on list or iterator, on official website: http://docs.python.org/2/library/functions.html?highlight= Sorted#sorted, the function prototype is:

Sorted (iterable[items[], CMP, key, reverse)

Parameter explanation:

(1) iterable specifies the list or iterable to sort, not to mention;

(2) CMP is a function that specifies a function to compare when sorting, you can specify a function or a lambda function, such as:

  students为类对象的list,没个成员有三个域,用sorted进行比较时可以自己定cmp函数,例如这里要通过比较第三个数据成员来排序,代码可以这样写:  students = [(‘john‘, ‘A‘, 15), (‘jane‘, ‘B‘, 12), (‘dave‘, ‘B‘, 10)]  sorted(students, key=lambda student : student[2])

(3) key is a function that specifies which item to sort the elements to be sorted, the function is illustrated by the above example, the code is as follows:
Sorted (students, KEY=LAMBDA student:student[2])

  key指定的lambda函数功能是去元素student的第三个域(即:student[2]),因此sorted排序时,会以students所有元素的第三个域来进行排序。

With the Operator.itemgetter function above, you can also use this function, for example, to sort by the third field of student, you can write:
Sorted (students, Key=operator.itemgetter (2))
The sorted function can also be ordered in multiple levels, for example, to sort by the second field and the third field, so you can write:
Sorted (students, Key=operator.itemgetter ())

That is, sort the second field first, and then sort by the third field.
(4) The reverse parameter does not have to say more, is a bool variable, indicating ascending or descending order, the default is False (ascending order), when defined as true will be sorted in descending order.

Machine Learning Series-K-NEARESTNEIGHBO

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.