First, K proximity algorithm idea: There is a collection of sample data, called the training sample set, and each data has a label, that is, we know each data in the sample set (here is a set of data, which can be the n-dimensional vector) and the corresponding relationship of the classification. After entering new data without a label, each feature of the new data (each element of the vector) is compared with the characteristics of the data in the sample set, and the algorithm extracts the category labels with the most similar characteristics in the sample set. Since the sample set can be very large, we select the first k most similar data and then count the labels with the most frequently occurring tags in k data as new data.
The general flow of the K proximity algorithm:
(1) Collect data: Can be local data, also can crawl from webpage.
(2) Prepare the data: will be structured, easy to operate.
(3) Analysis data: Any method can be used.
(4) Training algorithm: This step does not apply to the K proximity algorithm.
(5) Test algorithm: calculation error rate; Calculation formula: Error rate = number of test errors/Total tests
(6) Using the algorithm: Input sample data, output structured results, to determine which classification of new data.
An example of using K-Nearest neighbor algorithm
I'm using the Spyder development environment, and Python's version is 3.5,spyder's own NumPy library. Create a new knn.py file and complete this chapter's experiment in this document.
Write a data generation function in KNN:
1 fromNumPyImport*2 Importoperator3 4 defCreateDataSet ():5Group = Array ([[[1.0,1.1],[1.0,1.0],[0.0,0.0],[0.0,0.1]])6Labels = ['A','A','B','B']7 returnGroup,labels
In the Spyder, enter:
>>> Import KNN
>>>group,labels = Knn.createdataset ()
>>>group
Array ([[1., 1.1],
[1., 1.],
[0., 0.],
[0., 0.1]])
>>>labels
[' A ', ' a ', ' B ', ' B ']
The above hints indicate that the function is correct.
Third, k nearest Neighbor algorithm function
1 defclassify (inx,dataset,labels,k):2Datasetsize =Dataset.shape[0]3Diffmat = Tile (InX, (datasetsize,1))-DataSet4Sqdiffmat = diffmat**25Sqdistances = Sqdiffmat.sum (Axis=1)6distances = sqdistances**0.57Sorteddistindicies =Distances.argsort ()8ClassCount ={}9 forIinchRange (k):TenVoteilabel =Labels[sorteddistindicies[i]] OneClasscount[voteilabel] = Classcount.get (voteilabel,0) +1 ASortedclasscount =Sorted (Classcount.items (), -Key=operator.itemgetter (1), reverse=True) - returnSORTEDCLASSCOUNT[0][0]
Verify: Enter in the Spyder
>>> knn.classify ([0,0],group,labels,3)
The output should be ' B '.
Iv. Examples: dating site matching improvements
Helen has been collecting dating data for some time, and she put the data in the text file Datingdata.txt, with each sample data taking up a row, a total of 1000 lines (she may have dated 1000 people, too scary ^_^), each sample mainly includes the following 3 features:
1. Number of frequent flyer miles earned per year
2. Percentage of time spent playing video games
3. Number of ice cream litres consumed per week
The above data is stored in a text file, between the data in a space interval, before the data input classifier, the processing data must be changed to the classifier can process the data, in KNN, a function named File2matrix, data processing.
1 defFile2matrix (filename):2FR = open (filename,'R')3Arrayolines =Fr.readlines ()4NumberOfLines =Len (arrayolines)5Returnmat = Zeros ((numberoflines,3))6Classlabelvector = []7index =08 forLineinchArrayolines:9line =Line.strip ()TenListfromline = Line.split (' \ t') OneReturnmat[index,:] = Listfromline[0:3] AClasslabelvector.append (int (listfromline[-1])) -Index + = 1 - returnReturnmat,classlabelvector theRetarnmat,classlabelvector = File2matrix ('Datingdata.txt')
When I run this program, there is always an error message: Could not convert string to float: ' 12 34 56 ', for this problem, my method is to change the text between the data space to ', ' and will
Listfromline = Line.split (' \ t ') changed to
Listfromline = Line.split (', ')
This solves the problem, but it is not the best approach and needs to be improved.
"Machine learning Combat" learning note a K proximity algorithm