Chapter Content
K-Nearest Neighbor classification algorithm
Parsing and guiding people data from a text file
Creating a diffusion map using matplotlib
Normalized values
An overview of 2.1 K-Nearest neighbor algorithm
Simply put, the K-nearest neighbor algorithm uses the distance method of measuring different eigenvalues to classify.
The first machine learning algorithm described in this book is the K nearest neighbor algorithm (KNN), which works by having a collection of sample data, also known as a training sample set, and a label for each data in the sample set, that is, we know the correspondence of each data in the sample set to the owning category. After losing new data with no tags, each feature of the new data is compared with the feature in the sample set, and then the algorithm extracts the classification label of the most similar data (nearest neighbor) in the sample set. In general, we only select the first k most similar data in the sample data set, which is the source of K in the K-nearest neighbor algorithm, usually K is an integer not greater than 20. Finally, select the most frequently occurring classification of the K most similar data as the classification of the new data.
2.1.1 Preparation: Importing data using Python
2.1.2 parsing data from a text file
The function is to use the K-nearest neighbor algorithm to divide each set of data into a class with its pseudo-code as follows:
For each point in the dataset of the Unknown category property, do the following:
(1) Calculate the distance between the point in the data set of the known category and the current point;
(2) Sorting in ascending order of distance;
(3) Select k points with the minimum distance from the current point;
(4) Determine the frequency of the occurrence of the first moxibustion points in the category;
(5) Returns the category with the highest frequency of the former female point as the prediction classification of the current point .
The Python function classify0 () is shown in listing 2-1 of the program.
2.1.3 How to test the classifier
We have already constructed the first classifier using the female-neighbor algorithm, and we can check whether the classifier's answer matches our expectations. Readers may ask, "What is the case of the classifier going wrong?" "or" is the answer always correct? "The answer is no, the classifier does not get exactly the right results, we can use a variety of methods to detect the accuracy of the classifier."
In addition, the performance of the classifier is affected by a variety of factors, such as classifier settings and data sets. Different algorithms may behave differently on different datasets, which is why the classification algorithms are discussed in Chapter 6 of this section. To test the effect of the classifier, we can use the data of the known answer, and of course the answer cannot tell the classifier whether the result of the classifier is consistent with the expected result. With a lot of test data, we can get the error rate of the classifier------the number of times the classifier gives the wrong result divided by the total number of test executions. The error rate is a common evaluation method used primarily to evaluate the performance of a classifier on a data set. The error rate of the perfect classifier is 0, and the error rate of the worst classifier is 1.0, in which case the classifier cannot find a correct answer at all. The reader can see the actual data examples in later chapters.
The example described in the previous section is already working, but it is not much practical, and the latter two sections of this chapter will use the K nearest neighbor algorithm in the real world. First, we will use the K-nearest neighbor algorithm to improve the effect of the dating site, and then use the K-nearest neighbor algorithm to improve the handwriting recognition system. This book will use the test program of the handwriting recognition system to detect the effect of the K-nearest neighbor algorithm.
2.2 Example: using the K-nearest neighbor algorithm to improve the pairing effect of an appointment site
2.2.1 Preparing data: Parsing data from a text file
2.2.2 Analyzing data: Creating a scatter plot with matplotlib
The scatter function provided by the MATPLOTLIB library supports personalizing points on a scatter chart . Re-enter the above code and use the following parameters when calling the Scatter function:
2.2.3 Preparing data: Normalized values
When dealing with eigenvalues of this different range of values , we usually use the method of normalized values, such as the range of values to be processed from 0 to 1 or 1 to 1 . The following formula converts the eigenvalues of any range of values into values from 0 to 1 intervals:
2.2.4 Test algorithm: As a complete program validation classifier
we have processed the data as required in the previous section, and we will test the effect of the classifier if the classifier is correctrate to meet the requirements, Helen could use the software to process the appointment list provided by the dating site. Machine learning algorithms a verythe important task is to evaluate the correctness of the algorithm, and usually we only provide 90% of the data we have to train the classification as a training sample. and Use the remaining 10% data to test the classifier to detect the correct rate of the classifier. The following chapters of this book also describe some of the highlevel method to accomplish the same task, here we still use the most primitive approach. It should be noted that 10% of the test data should beis randomly selected, because the data provided by Helen is not sorted by a specific purpose, so we can choose 10% random numbersit does not affect its randomness.
we mentioned earlier that we can use the error rate to detect the performance of the classifier. For classifiers, the error rate is the classificationThe number of error results divided by the total number of test data, the error rate of the perfect classifier is 0, and the classifier with Error rate 1.0No Correct classification results are given. In the code we define a counter variable, each time the classifier incorrectly classifies the data, The counter is incremented by 1, and the result of the counter is divided by the number of data points, which is the error rate.
To test the classifier effect, create the function datingclasstest in the knn.py file, which is self-contained , and you can test the classifier effect at any time using the function in the Python runtime environment. In the knn.py file, the following program code is also lost.
2.2.5 using algorithms: Building a complete and usable system
we have already tested the classifier on the data, and now we can use this classifier to classify people for Helen. We will give Helen a small program, through which Helen will find someone on the dating site and enter his information. The program gives her a predictive value of how much she likes each other.
2.3 Example: Handwriting recognition system
in this section, we construct a handwriting recognition system that uses K-nearest neighbor classifier step-by-step. For simplicity, the system constructed here can only recognize numbers 0 to 9, see Figure 2.6. The numbers that need to be identified have been processed into the same color and Size ® by using the graphics processing software : The wide high is a black-and-white image of 32 pixels *32 pixels . Although storing images in text format does not make efficient use of the memory space, we convert the images to text format for ease of understanding.
2. 3. 1 Preparing data: Converting an image to a test vector
2.3.2 Test algorithm: Using the nearest neighbor algorithm to recognize handwritten numbers
2. 4 Summary of this chapter
The K-Nearest neighbor algorithm is the simplest and most efficient algorithm for classifying data , and this chapter describes how to construct classifiers using the K-nearest neighbor algorithm in two examples.
The K-Nearest neighbor algorithm is an instance-based learning, and we must have training sample data close to the actual data when using the algorithm. The K-Nearest neighbor algorithm must hold all data sets , and if the training data set is large, a large amount of storage space must be used. In addition, because distance values must be calculated for each data in the dataset, it can be very time-consuming to actually use it .
Another drawback of the K-nearest neighbor algorithm is that it cannot give any data infrastructure information , so we cannot know what the average instance sample and typical instance samples have. In the next chapter we will use the probability measurement method to deal with the classification problem, the algorithm can solve this problem.
Chapter Two: K-Nearest neighbor algorithm