Dr. Hangyuan Li: On my understanding of machine learning

Last Update:2015-07-25 Source: Internet

Author: User

Tags svm

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Original: http://www.itongji.cn/article/06294DH015.html

Machine learning methods are very much, but also very mature. I'll pick a few to say.

the first is SVM. Because I do more text processing, so more familiar with SVM. SVM is also called Support vector machine, which maps data into multi-dimensional space in the form of dots, and then finds the optimal super-plane which can be classified, and then classifies it according to this plane. SVM can predict the data outside the training set well, the generalization error rate is low, the calculation cost is small, the result is easy to explain, but it is too sensitive to parameter regulation and kernel function parameters. Personal feeling SVM is the best method for two classification, but also limited to two classification. If you want to use SVM for multiple classifications, it is also possible to implement multiple two classifications in vector space.

SVM has a core function SMO, which is the sequence minimization optimization algorithm. SMO is basically the fastest two-time planning optimization algorithm, the core of which is to find the optimal parameter α, after the calculation of the super-plane classification. The SMO method can decompose the large optimization problem into several small optimization problems, which greatly simplifies the solution process.

Another important function of SVM is the kernel function. The main function of kernel function is to map data from low space to high dimensional space. I will not say the details, because there is too much content. In short, the kernel function can solve the nonlinear problem of data very well, without considering the mapping process.

the second one is KNN. KNN compares the data characteristics of the test set with the data of the training set, and then extracts the classification label of the nearest neighbor data in the sample set, that is, the KNN algorithm uses the method of measuring the distance between the different eigenvalues to classify. KNN's idea is simple, is to calculate the distance between the test data and the category center. KNN has the characteristics of high precision, insensitive to outliers, no data input hypothesis, simple and effective, but its disadvantage is also obvious, the computational complexity is too high. To classify a data, but to calculate all the data, it is a terrible thing in the context of big data. Furthermore, the accuracy of KNN classification is not too high when the category exists in the range overlap. Therefore, KNN is suitable for small amounts of data and the accuracy of the data is not very high.

KNN has two functions that affect the result of classification, one is data normalization, and the other is distance calculation. If the data is not normalized, the final result will be greatly affected when the range of multiple features varies greatly, and the second is the distance calculation. This should be the core of the KNN. At present, the most distance calculation formula is Euclidean distance, which is our usual vector distance calculation method.

Personal feeling, KNN the most important role is can be calculated over time, that is, the sample cannot be acquired only with time one by one, KNN can play its value. As for the other characteristics, it can do, many ways can do, but other can do it.

The third one is naive Bayes . Naive Bayes abbreviation NB (Ox x), why is it ox X, because it is based on Bayes probability of a classification method. The Bayesian approach can be traced back to hundreds of years ago, with a deep probabilistic basis and very high reliability. Naive Baye Chinese is called naive Bayesian, why is it called "plain"? Because it is based on a given hypothesis: when a target value is given, the attributes are independent of each other. For example, I say "I Like You", which assumes that there is no connection between "I", "like", "you". Think about it, it's almost impossible. Marx tells us: there is a connection between things. There is a greater connection between the attributes of the same thing. Therefore, the simple use of NB algorithm efficiency is not high, most of the method has been a certain improvement in order to adapt to the needs of data.

NB algorithm in the text classification is very much, because the text category mainly depends on keywords, text classification based on the word frequency in the center of NB. But because of the hypothesis mentioned earlier, the method is not good for the Chinese classification, because the Chinese gu about his situation is too much, but to straight to the old United States language, the effect is good. As for the core algorithm, the main idea is all in Bayesian, there is nothing to say.

The fourth one is the return . There are many regression, logistic regression ah, ridge return AH what, according to different needs can be divided into many kinds. Here I mainly talk about logistic regression. Why is it? Because logistic regression is mainly used for classification, rather than prediction. Regression is the fitting of some data points to these points in a straight line. Logistic regression refers to the establishment of regression formula based on the existing data to classify the boundary lines. The calculation cost is not high, easy to understand and realize, and most of the time is used for training, the classification is very fast after the completion of training, but it is easy to fit and the classification accuracy is not high. The main reason is that the logistic is mainly linear fitting, but many things in reality do not satisfy the linearity. The regression method itself has limitations, even if there are two fitting, three fitting curve fitting and only a small part of the data can not fit the most data. But why do you have to put it here? Because the regression method is not suitable for most, but once appropriate, the effect is very good.

Logistic regression is actually based on a curve, "line" This continuous representation of a big problem, that is, the jump data will produce a "step" phenomenon, it is hard to say that the data suddenly turn. So with logistic regression, you must use a sigmoid function called the Heivissede jump function to represent the transition. The results of classification can be obtained by sigmoid.

In order to optimize logistic regression parameters, we need to use an optimization method of "gradient rise method". The core of the method is that the best parameters of the function can be found as long as the search is in the gradient direction of the function. However, this method needs to traverse the entire data set every time the regression coefficients are updated, and it is not ideal for big data effects. Therefore, a "random gradient rise algorithm" is needed to improve it. This method updates the regression coefficients with only one sample point at a time, so the efficiency is much higher.

The fifth one is the decision tree . As far as I know, the decision tree is the simplest and the most commonly used classification method. Decision tree based on tree theory to achieve data classification, personal feeling is the data structure of B + tree. A decision tree is a predictive model that represents a mapping between object properties and object values. The decision tree has low computational complexity, is easy to understand the output result, is insensitive to middle value deletion, and can handle irrelevant feature data. It is better than KNN to understand the intrinsic meaning of the data. But its disadvantage is that it is prone to over-matching, and construction is time-consuming. Another problem with decision trees is that if you don't draw a tree structure, the classification details are difficult to understand. Therefore, the decision tree is generated, then the decision tree is drawn, and finally, the classification process can be better understood.

The division of the core tree of the decision tree. The decision tree's bifurcation is the basis of the decision trees. The best way is to use information entropy to implement . The concept of entropy is a headache, it is easy to confuse people, simply speaking, is the complexity of information. The more information, the higher the entropy. So the core of decision tree is to divide data set by computing information entropy.

I also have to say a more special classification method: AdaBoost. AdaBoost is the representative classifier of the boosting algorithm. Boosting is based on the meta-algorithm (integrated algorithm). That is, consider the results of other methods as a reference, that is, a way to combine other algorithms. To be blunt, the random data on a data set is trained multiple times using a classification, each time assigning the right value to the correctly classified data, and increasing the weight of the data that is being classified incorrectly, so iteratively iterating until the required requirements are met. AdaBoost generalization error rate is low, easy to encode, can be applied on most classifiers, no parameter adjustment, but sensitive to outliers. This method is not an independent method, but it must be based on the meta-method to improve efficiency. Personally, the so-called "AdaBoost is the best way to classify" this sentence is wrong, it should be "adaboost is a better way to optimize".

Dr. Hangyuan Li: On my understanding of machine learning

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More