Dr. Hangyuan Li's "Talking about my understanding of machine learning" machine learning and natural language processing
[Date: 2015-01-14] |
Source: Sina Weibo Hangyuan Li |
[Font: Big Small] |
Calculating time, from the beginning to the present, do machine learning algorithms will be nearly eight months. Although it has not reached the point of mastery, but at least in the familiar with the algorithm of the process, I have the choice of algorithms and the ability to create a small increase. To tell you the truth, machine learning is very difficult, very difficult, to do a full understanding of the algorithm's process, characteristics, implementation methods, and in the right data before choosing the right method to optimize to get the best results, I think there is no eight years of 10 years of hard work is impossible. In fact, the whole field of artificial intelligence is a scientific research problem, including pattern recognition, machine learning, search, planning and other issues, can be used as a separate subject exists. I don't think anyone can make the most of all aspects of AI, but if you can master either of these directions, at least in the current category of cutting-edge areas, are not small achievements.
This journal, as my 2014 years of academic summary, elaborated at present my understanding of machine learning, I hope you crossing criticism, a lot of communication!
Machine learning (machinelearning), in my opinion, is the process of making machines learn about people's thinking. The purpose of machine learning is to let machines learn "how people recognize things," and we want people to learn from things as well as things that machines learn from things, which is the process of machine learning. There is a classic question in machine learning:
"Suppose there is a color-rich oil painting in which a dense forest is drawn, and a monkey sits on a tree and eats something in a crooked neck tree far away from the forest." If we let a person find the position of the monkeys, it is normal to point out the monkeys in less than a second, and even some people can see the monkey at first sight. ”
So the question is, why can one recognize a monkey in an image mixed with thousands of colors? In our lives, things can be seen everywhere, how do we identify different kinds of content? Perhaps you might have thought of--experience. Yes, it is experience. Empirical theory tells us that everything we know is learned through learning. For example, to lift a monkey, we will immediately appear in the mind of the various monkeys we have seen, as long as the characteristics of the monkeys in the picture and our consciousness of the monkeys, we may be identified in the picture is a monkey. In extreme cases, when the character of the monkey in the picture is exactly the same as what we know about a certain class of monkeys, we will determine what kind of monkey is in the picture.
Another situation is when we admit our mistakes. In fact, the error rate of people to identify things is sometimes very high. For example, when we meet the words we don't know, we subconsciously read the parts we know. For example, "in full swing" The word, is not a friend also like me once read "Like Fire like tea (chá)"? The reason we make mistakes is that we subconsciously use experience to explain the unknown without seeing the word.
At present, the technology is so developed, there are cattle to consider can allow the machine to imitate the human recognition method to achieve the effect of machine recognition, machine learning has emerged.
Fundamentally, recognition is the result of a classification. Seeing four-legged creatures, we may immediately classify the creature as an animal, because we often see four-legged, living things, and more than 90% animals. Here, the question of probability is involved. We tend to have a high rate of recognition of things around us because the subconscious mind almost records all the characteristics of things that the naked eye sees. For example, we enter a new collective, at first we do not know, sometimes people and names are not on the number, the main reason is that we do not grasp the characteristics of things, but also can not be through the existing characteristics of the people around the classification. At this time, we often have this consciousness: hey, you seem to be called Zhang San? Oh, no, you seem to be John Doe. This is the probability in the classification, it may be a result, there may be B results, and even more results, the main reason is that our brains collect not enough features, and can not be accurately classified. When everyone is familiar with each other, one can identify who is who, and even in extreme cases, only listen to the voice of the people can be recognized, which shows that we have a very precise characteristics of the thing.
So, I think, there are four basic steps for people to identify things: learning, extracting features, identifying, classifying.
So can the machine imitate the process to realize recognition? The answer is yes, but it's not that easy. There are three problems: first, the human brain has countless neurons for data exchange and processing, in the current machine is not equal to the processing conditions; second, people's extraction of the characteristics of things is subconscious, the extraction of information in the unconscious situation, the error is very large; third, and most important, human experience comes from the life of every person , that is, people are in the study all the time, how to allow the machine to do all aspects of self-learning? Therefore, at present in the field of artificial intelligence has not reached the level of class, I think the main reason is that the machine has no subconscious. The human subconscious is not completely controlled by human consciousness, but it can improve the probability of human being to recognize things. We cannot load the unconscious machine, because the consciousness of active loading is subjective, and the function of the human subconscious cannot be fulfilled in the machine. Therefore, in view of the current development situation, to achieve a complete class of people, there is not a short time. But even so, machines that differ greatly from people's minds can still help our lives. For example, our commonly used online translation, search system, expert system, etc., are the product of machine learning.
So, how to realize machine learning?
On the whole, machine learning is the process of imitating people's recognition of things, namely: learning, extracting features, identifying and classifying. Because the machine can not be the same as the human thinking according to the characteristics of the natural choice of classification method, so the choice of machine learning methods still need to choose manually. At present, there are three main methods of machine learning: supervised learning, semi-supervised learning and unsupervised learning. Supervised learning is the process of adjusting the parameters of a classifier using a set of known classes of samples to achieve the required performance. In the vernacular, it is based on known, inferred unknown. The representative methods are: Nave Bayes, SVM, decision Tree, KNN, neural network and logistic analysis, etc. the semi-supervised method mainly considers how to use a small number of labeling samples and a lot of unlabeled samples for training and classification, that is, according to a small number of known and a large number of unknown content classification. Representative methods are: Maximum expectation, generation model and graph algorithm, etc. Unsupervised learning is the process of adjusting the parameters of a classifier with a set of known classes of samples to achieve the required performance. That is, and learn from myself. The representative methods are: Apriori, FP tree, K-means, and the deep learning of the current comparison fire. From these three aspects, unsupervised learning is the most intelligent, can realize the potential of machine initiative consciousness, but the development is relatively slow, supervised learning is not very reliable, from the known inference unknown, it is necessary to all the possibilities of things to learn, which is impossible in reality, people can not do; semi-supervised learning is "the way "Since unsupervised learning is difficult, supervised learning is not reliable, take a compromise, each take the director." The current development is that the supervised learning technology is already mature, unsupervised learning is still in the beginning, so the supervision of learning methods to achieve semi-supervised learning is the current mainstream. But these methods can only extract information, not effective prediction (people think, since it is not possible to get more, first look at what is in hand, then data mining appeared).
Machine learning methods are very much, but also very mature. I'll pick a few to say.
The first is SVM. Because I do more text processing, so more familiar with SVM. SVM is also called Support vector machine, which maps data into multi-dimensional space in the form of dots, and then finds the optimal super-plane which can be classified, and then classifies it according to this plane. SVM can predict the data outside the training set well, the generalization error rate is low, the calculation cost is small, the result is easy to explain, but it is too sensitive to parameter regulation and kernel function parameters. Personal feeling SVM is the best method for two classification, but also limited to two classification. If you want to use SVM for multiple classifications, it is also possible to implement multiple two classifications in vector space.
SVM has a core function SMO, which is the sequence minimization optimization algorithm. SMO is basically the fastest two-time planning optimization algorithm, the core of which is to find the optimal parameter α, after the calculation of the super-plane classification. The SMO method can decompose the large optimization problem into several small optimization problems, which greatly simplifies the solution process.
Another important function of SVM is the kernel function. The main function of kernel function is to map data from low space to high dimensional space. I will not say the details, because there is too much content. In short, the kernel function can solve the nonlinear problem of data very well, without considering the mapping process.
The second one is KNN. KNN compares the data characteristics of the test set with the data of the training set, and then extracts the classification label of the nearest neighbor data in the sample set, that is, the KNN algorithm uses the method of measuring the distance between the different eigenvalues to classify. KNN's idea is simple, is to calculate the distance between the test data and the category center. KNN has the characteristics of high precision, insensitive to outliers, no data input hypothesis, simple and effective, but its disadvantage is also obvious, the computational complexity is too high. To classify a data, but to calculate all the data, it is a terrible thing in the context of big data. Furthermore, the accuracy of KNN classification is not too high when the category exists in the range overlap. Therefore, KNN is suitable for small amounts of data and the accuracy of the data is not very high.
KNN has two functions that affect the result of classification, one is data normalization, and the other is distance calculation. If the data is not normalized, the final result will be greatly affected when the range of multiple features varies greatly, and the second is the distance calculation. This should be the core of the KNN. At present, the most distance calculation formula is Euclidean distance, which is our usual vector distance calculation method.
Personal feeling, KNN the most important role is can be calculated over time, that is, the sample cannot be acquired only with time one by one, KNN can play its value. As for the other characteristics, it can do, many ways can do, but other can do it.
The third one is naive Bayes. Naive Bayes abbreviation NB (Ox x), why is it ox X, because it is based on Bayes probability of a classification method. The Bayesian approach can be traced back to hundreds of years ago, with a deep probabilistic basis and very high reliability. Naive Baye Chinese is called naive Bayesian, why is it called "plain"? Because it is based on a given hypothesis: properties are independent of each other when given a target value. For example, I say "I Like You", which assumes that there is no connection between "I", "like", "you". Think about it, it's almost impossible. Marx tells us: there is a connection between things. There is a greater connection between the attributes of the same thing. Therefore, the simple use of NB algorithm efficiency is not high, most of the method has been a certain improvement in order to adapt to the needs of data.
NB algorithm in the text classification is very much, because the text category mainly depends on keywords, text classification based on the word frequency in the center of NB. But because of the hypothesis mentioned earlier, the method is not good for the Chinese classification, because the Chinese gu about his situation is too much, but to straight to the old United States language, the effect is good. As for the core algorithm, the main idea is all in Bayesian, there is nothing to say.
The fourth one is the return. There are many regression, logistic regression ah, ridge return AH what, according to different needs can be divided into many kinds. Here I mainly talk about logistic regression. Why? Because logistic regression is mainly used for classification, rather than prediction. Regression is the fitting of some data points to these points in a straight line. Logistic regression refers to the establishment of regression formula based on the existing data to classify the boundary lines. The calculation cost is not high, easy to understand and realize, and most of the time is used for training, the classification is very fast after the completion of training, but it is easy to fit and the classification accuracy is not high. The main reason is that the logistic is mainly linear fitting, but many things in reality do not satisfy the linearity. The regression method itself has limitations, even if there are two fitting, three fitting curve fitting and only a small part of the data can not fit the most data. But why do you have to put it here? Because the regression method is not suitable for most, but once appropriate, the effect is very good.
Logistic regression is actually based on a curve, "line" This continuous representation of a big problem, that is, the jump data will produce a "step" phenomenon, it is hard to say that the data suddenly turn. So with logistic regression, you must use a sigmoid function called the Heivissede jump function to represent the transition. The results of classification can be obtained by sigmoid.
In order to optimize logistic regression parameters, we need to use an optimization method of "gradient rise method". The core of the method is that the best parameters of the function can be found as long as the search is in the gradient direction of the function. However, this method needs to traverse the entire data set every time the regression coefficients are updated, and it is not ideal for big data effects. Therefore, a "random gradient rise algorithm" is needed to improve it. This method updates the regression coefficients with only one sample point at a time, so the efficiency is much higher.
The fifth one is the decision tree. As far as I know, the decision tree is the simplest and the most commonly used classification method. Decision tree based on tree theory to achieve data classification, personal feeling is the data structure of B + tree. A decision tree is a predictive model that represents a mapping between object properties and object values. The decision tree has low computational complexity, is easy to understand the output result, is insensitive to middle value deletion, and can handle irrelevant feature data. It is better than KNN to understand the intrinsic meaning of the data. But its disadvantage is that it is prone to over-matching, and construction is time-consuming. Another problem with decision trees is that if you don't draw a tree structure, the classification details are difficult to understand. Therefore, the decision tree is generated, then the decision tree is drawn, and finally, the classification process can be better understood.
The division of the core tree of the decision tree. The decision tree's bifurcation is the basis of the decision trees. The best way is to use information entropy to implement. The concept of entropy is a headache, it is easy to confuse people, simply speaking, is the complexity of information. The more information, the higher the entropy. So the core of decision tree is to divide data set by computing information entropy.
I also have to say a more special classification method: AdaBoost. AdaBoost is the representative classifier of the boosting algorithm. Boosting is based on the meta-algorithm (integrated algorithm). That is, consider the results of other methods as a reference, that is, a way to combine other algorithms. To be blunt, the random data on a data set is trained multiple times using a classification, each time assigning the right value to the correctly classified data, and increasing the weight of the data that is being classified incorrectly, so iteratively iterating until the required requirements are met. AdaBoost generalization error rate is low, easy to encode, can be applied on most classifiers, no parameter adjustment, but sensitive to outliers. This method is not an independent method, but it must be based on the meta-method to improve efficiency. Personally, the so-called "AdaBoost is the best way to classify" this sentence is wrong, it should be "adaboost is a better way to optimize".
Well, said so much, I'm a little dizzy, there are some ways to write in a few days. In general, the machine learning approach is to use the existing data as an experience to let the machine learn, in order to guide the subsequent decision. For now, it is possible to classify big data with the help of distributed processing technology and cloud technology, but once the training is successful, the efficiency of classification is considerable, which is the same as when people are older, the more accurate the problem is. In the eight months, from the initial understanding to one step to achieve, from the logic of the demand to the implementation of the choice of methods, every day is hard, but every day is also a tense stimulation. I want to learn this every day what kind of classification can be achieved, in fact, it is exciting to think. The main reason I ran away from being a programmer was because I didn't like to do what I already knew, because there was no expectation for that job, and now I can use data analysis to get things I can't imagine, which not only satisfies my curiosity, but also makes me happy at work. Perhaps, I am far away from the technical needs of society, but I am full of confidence, because, I do not feel bored, do not feel hesitant, although some powerless, but a firm attitude.
"Reprint" Dr. Hangyuan Li's "Talking about my understanding of machine learning" machine learning and natural language processing