A conceptual atlas of machine learning
Second, what is machine learning
Machine learning (machine learning) is a recent hot field, about some of its basic definitions Baidu encyclopedia, Wikipedia or online can find a lot of information, so here do not do too much explanation.
We have two modes of solving a problem:
One is called model driven, by studying the physical and chemical mechanisms of the object, the object is modeled to solve the problem, such as the Newton's Three Laws we know well, and for the above formula we have known input x and mechanism model F (), we need to solve the y we want to get.
Another kind is called data driven, as people encounter more and more complex problems, the cost of looking for object mechanism model is more and more, conversely, the cost of data acquisition is getting smaller, so researchers began to think about the problem from another angle, whether this data can be analyzed to get what I want, That I know some of the samples (x,y) or I only know x, I want to analyze these to get the object's model F (), and then when I have an X again, I can get the y I want, if not so strictly, all this method of data analysis can be counted as machine learning category.
So the basic elements that a machine learning should normally include are: training data, model with parameters, loss function, training algorithm training The data function is needless to say; the model with parameter is used to approximate f (x) Loss function is an index to evaluate the model, for example, the accuracy training algorithm of model recognition classification can also be called the optimization function. , which is used to continuously update the parameters of the model to minimize the loss function and get a better model, or a learning machine.
Third, sample data
The sample data is what we mentioned above (x,y), where x is called input data, and y is called output data, or a more professional name-label (label). Usually x and y are high dimensional matrices, with X as an example:
x= (x1,x2,x3,..., XI)
Where Xi represents the first I input samples, such as the I text, the I picture, Xi can be one-dimensional text vector, two-dimensional picture matrix, three-dimensional video matrix, or higher dimensional data types, with one-dimensional vector as an example:
Xi= (x1i,x2i,x3i,..., xni)
Where Xni represents the value of the nth element of the XI data, such as the grayscale value of the nth pixel after flattening the image, and so on.
Label Y according to the requirements of different forms, with the simplest n classification problem, for example, Yi is an n-dimensional one-hot, where a value of 1, the rest of the elements are 0, the first few elements of the 1 indicates that belong to the first few categories.
Iv. data sets
The complete dataset is represented as t={(x1,y1), (X2,y2), (x2,y2),..., (Xi,yi)}, and for a learning machine, not all data is used for training learning models, but is divided into three parts: training data, cross-validation data, test data. Training data: As the name suggests, training data is used to train learning models, usually at a rate not less than half of the total amount of training. Cross-validation data (cross validation data): Cross-validation is used to measure the quality of the model in the training process, because most of the machine learning algorithms are not obtained by the Analytic method, but are slowly optimized by iterative iteration. So cross-validation data can be used to monitor the performance changes during model training. Test data: After the model has been trained, the test data is used to measure the performance of the final model, which is also a measure of the performance of the model, the Cross-validation index can only be used to monitor and assist model training, not to represent the model good or bad, so even if the accuracy of cross-validation is 100%. And the accuracy of the test data is 10%, then the model can not be recognized. The proportion of cross-validation and test data is usually half the same.
V. characteristics
Characteristic is a special noun in machine learning and pattern recognition, in traditional machine learning algorithm, the data dimension cannot be too high because of the limitation of calculation performance and parameter. Our phone has a few megabytes of data in a single photo, there may be millions of pixels, so high dimensional data can not be directly input to the learning machine, so we need to specific applications to extract the corresponding eigenvector, the main function of the eigenvector is two: Reduce the data dimension: by extracting the eigenvector, The dimensions of the original data are greatly reduced, and the parameters of the model are simplified. Improve the performance of the model: a good feature, you can advance the original data is the most critical part of the extraction, so you can improve the performance of the learning machine.
In the traditional machine learning field, how to extract a good feature is everyone's most concerned about, so the study of machine learning has become a great part of the search for good features, so also the birth of a discipline called feature engineering. The following is an example of pedestrian detection using hog features, hog feature is mainly to detect the contour information of objects, so it can be used for pedestrian detection.
VI. Model
The model here may not be accurate in terms of words, but I would like to express the following: There are some parameters to be trained to approximate the set of parameters mentioned in the preceding article F (x). In the parameter space, F (x) is just a point, and the model I mentioned is also a point, and because the parameters can be changed, all I have to do is to get this point of my model as close as possible to the point of the real f (x). There are many model algorithms for machine learning, but the more commonly used models can be summed up as three: Network-based models: The most typical is the neural network, the model has several layers, each layer has several nodes, each two nodes have a parameter can be changed, through a large number of nonlinear neurons, Neural networks can approximate any function. Kernel-based models: typically SVM and Gaussian PROCESS,SVM map input vectors through a kernel to high-dimensional space, and then find several hyperplane to divide the data into several categories, and the kernel of SVM can be adjusted. Based on the statistical learning model: The simplest example is Bayesian learning machine, the statistical study method is the use of mathematical statistics to achieve the training machine, usually the parameters of the model are some mean square poor statistical characteristics, and ultimately make the prediction of the correct probability of the greatest expectations.
A good learning machine model should have excellent expression approximation ability, easy programming implementation, parameter easy training and other characteristics.
Supervision Learning and unsupervised learning
According to the different tasks, the learning machine can be divided into supervised learning (supervised learning) and unsupervised learning (unsupervised) two, from a mathematical point of view the difference is that the former know the data label y the latter do not know the sample label Y, Therefore, unsupervised learning is a little more difficult.
For example, a mother has a child who knows the number, when the mother gets a digital card, tells the child that this is the number 4 is the number 6, and then after a lot of teaching, when the current get a card to ask the child this is a number, this is supervised learning. If the mother's pile of digital cards, let the child put the card according to different numbers, the mother told the child that he was good, probably through a lot of training, the child knows how to properly stack the card, this is an example of unsupervised learning. A less appropriate term is to explain that supervised learning can be regarded as a classification problem, but unsupervised can be regarded as a problem of clustering.
Of course, there are two special types, called semi-supervised learning and reinforcement learning, and semi-supervised learning means that some samples are known as labels, but other samples are not known. Intensive learning is another special case, in order not to confuse the understanding, here do not explain, interested in can be consulted, and then I will be a separate blog to introduce.
Supervised learning is simple and efficient, but unsupervised learning is more useful because the cost of manual tagging of sample labels is very expensive and time-consuming.
Viii. loss function
A loss function (loss function) should be called the objective function more rigorously, because one of the objective functions in statistical learning is to maximize the prediction of the correct expected probability, we only consider the common loss function.
Loss function is an important index to approximate the quality of the model, the greater the value of the loss function is, the greater the prediction error of the model, so what we need to do is to update the parameters of the model and minimize the value of the loss function. Commonly used loss functions are many, simplest such as the 0-1 loss function:
L (Y,f (x)) ={01y=f (x) y≠f (x)
This loss function is very well understood, the prediction of the loss of 0, the prediction is wrong for 1, so the most perfect learning machine loss function value should be 0. Of course, the least squares error, cross entropy error and other loss functions are also very common, the loss function used in training is the same as the loss of all training sample data. With the loss function, the training of the model becomes a typical optimization problem.
Nine, the optimization function
We have the objective function, the loss function, and now I need something to keep updating the model parameters based on the loss value, which is called the optimization function. The function of the optimization functions is to find the optimal solution of the loss function in the parameter space. Gradient Descent method is the most well-known optimization function, we all use the image of the mountain to describe this algorithm. If we are on the mountain, our goal is to find the lowest point of the mountain (minimize loss function), a very simple idea is that I find the current position of the most downhill angle direction, and then go in this direction, as shown in the following figure
Of course this method has a problem is to fall into the local optimal point (local pits) out, so a variety of better optimization functions are gradually found. A good optimization function should have two performance metrics: the ability to find the global optimal solution by jumping out of the local optimal solution, with faster convergence rate.
Generalization ability, less fitting and over fitting
Generalization ability (generalization ability) refers to the ability of machine learning model to predict unknown data, which is the essential nature of learning method, and the most adopted method is to evaluate the generalization ability of learning method by error. But this evaluation depends on the test data set, because the test data set is limited, so this idea is not completely reliable, so some people specialize in generalization error to better express generalization ability.
The lack of fit (underfitting) and over fitting (overfitting) are two kinds of model training phenomena to be avoided as far as possible, and these two phenomena show that the model does not achieve a more ideal generalization ability. Due to the complexity of the model is too low, so that the model can not express the generalization of the ability to test samples and training samples are not very good predictive performance. In contrast, the model complexity is too high, so that the model has a good predictive performance for the training samples, but the predictive performance of the test samples is poor, the final generalization ability is not.
As shown in the following figure, the 1 and 4 show the fitting phenomenon, 3 and 6 show the over fit. And a good model should be like 2 and 5, the complexity is appropriate, generalization ability is strong.
Xi. deviations, errors and variances
Bias (deviations), error (Error), and variance (variance) are easy to confuse concepts, first
Error2=bias2+variance
Error reflects the accuracy of the whole model, bias reflects the model between the output and the real value of the error, that is, the accuracy of the model itself, variance reflects the model of each output and model output expectations between the error, that is, the stability of the model. As shown in the following figure, as the complexity of the model increases, the deviation of the model prediction becomes smaller, but the variance becomes larger and the distribution of the predicted result is dispersed.
12. Machine learning and deep learning
At present, the deep learning is usually referred to as the Advanced Learning Network based on neural network, compared with the traditional neural network, the depth Learning Network has higher model complexity, so it can directly input the original data into the learning machine, without the need of manual extraction features. Therefore, if not from the mathematical point of view, the most essential difference between traditional machine learning and deep learning is that deep learning has the ability to train highly complex models, so you can not manually extract features, namely
Deep learning = Manual extraction feature + Traditional machine learning method
13. Accuracy rate and Recall rate (Precision & Recall)
Accuracy and recall rates are two measures widely used in the field of information retrieval and statistical classification to evaluate the quality of the results. The accuracy is to retrieve the number of related documents and the total number of documents retrieved, to measure the precision of the retrieval system; Recall is the ratio of the number of documents retrieved and the number of documents in the document library, and the recall of the retrieval system is measured.
In general, precision is the number of retrieved items (such as documents, Web pages, etc.) is accurate, recall is the number of all accurate entries have been retrieved.
The correct rate, recall rate and F value are the important evaluation indexes to select the target in the mixed environment. Take a look at the definition of these indicators first:
Correct rate = number of correct message strips/extracted information strips
Recall rate = Number of information in the correct Information Bar/sample extracted
The value is between 0 and 1, the closer the value is to 1, the higher the precision or recall.
f = Correct rate * Recall rate * 2/(correct rate + recall rate) (f value is the correct rate and recall rate of the harmonic mean)
Take this example: there are 1400 carp, 300 shrimp and 300 turtles in a pond. It is now for the purpose of catching carp. 700 Carp, 200 shrimp and 100 turtles were caught in a large net. Then, the indicators are as follows:
Correct rate = 700/(700 + 200 + 100) = 70%
Recall rate = 700/1400 = 50%
F value = 70% * 50% * 2/(70% + 50%) = 58.3%
Let's see if all the carp, shrimp and turtle in the pool are catch, what's the difference:
Correct rate = 1400/(1400 + 300 + 300) = 70%
Recall rate = 1400/1400 = 100%
F value = 70% * 100% * 2/(70% + 100%) = 82.35%
Thus, the correct rate is the proportion of target results in the results of the capture; recall, as the name suggests, is the proportion of the target category from the area of concern, and the F value is the evaluation index of the combination of the two indicators, which is used to comprehensively reflect the overall index.
The
Certainly wants the results to be precision as high as possible, and the higher the recall is, the better, but in fact the two are contradictory in some cases. For example, in extreme cases, we only search for a result, and it is accurate, then precision is 100%, but recall is very low, and if we return all the results, such as recall is 100%, but precision will be very low. Therefore, in different occasions need to judge their own hope precision higher or higher recall. If you are doing experimental research, you can draw Precision-recall curves to help with the analysis.