After talking about the characteristics, we should talk about the choice and realization of the model. Although a lot of machine learning methods and models have been approached, it is only recently that the supervised learning has some sketchy, and the scattered knowledge is summarized at the same time when the model is introduced. (who makes me forgetful.) )
The basic mode of supervised learning
Chen Tianchi has an article in the boost tree that mentions the key concepts of supervised learning, copying to deepen the impression:
Supervised Learning elements: sample (Mark) Model Parameters Objective function optimization method
I. Models and Parameters
The model refers to how the given input XI predicts the output of Yi. Our more common models, such as linear models (including linear regression and logistic regression), use a linear superposition method to predict Y^i=∑jwjxijy^i=∑jwjxij. In fact, the prediction yy here can have different interpretations, such as we can use it as the output of the return target, or the sigmoid transformation to get the probability, or as a sort of indicators. A linear model is used for regression, classification, or sequencing, depending on the interpretation of YY (and the corresponding objective function of the design). Parameters refer to what we need to learn, and in linear models, parameters refer to our linear coefficient ww.
? Notations:i-th Training Example
? Model:how to make prediction given
? Parameters:the things we need to learn from data
Ii. objective function: Loss + regular
The model and the parameters themselves specify the given input how we make predictions, but don't tell us how to find a better parameter, this time we need the target function. The general objective function consists of the following two items:
? Objective function (target functions)
Common error functions are L=∑nil (yi,y^i) l=∑inl (yi,y^i) such as squared error L (yi,y^i) = (yi?y^i) 2l (yi,y^i) = (yi?y^i) 2, logistic error function (L (yi,y^i) =yiln (1 +e?y^i) + (1?yi) ln (1+ey^i) L (yi,y^i) =yiln? (1+e?y^i) + (1?yi) ln? (1+ey^i)) and so on. For the common regularization term of linear model, there are l2l2 regular and l1l1 regular. The design of this objective function comes from an important concept in statistical learning, called Bias-variance Tradeoff,bias, which can be understood as assuming that we have infinite numbers of data, we can train the best model to get the error. And variance is because we only have finite data, and the randomness brings the error. The error function in the target encourages our model to fit the training data as much as possible, so that the final model will have relatively little bias. Regularization, however, encourages simpler models. Because when the model is simple, the randomness of the results of the finite data fitting is relatively small, and it is not easy to fit, which makes the prediction of the final model more stable.
Iii. Optimization Method:
How to learn the model parameters after the given objective function and then optimize the objective function, that is, how to learn the problem. Different models often have unique optimization methods, which are detailed in the section describing the model.
Common Models Overview Tree Ensemble
Now in the data competition, the most efficient and effective model is based on the classification or regression tree tree model. Here are some of the information that has been made simple:
Basics: Classification and regression tree
Decision trees should be very familiar with, in fact, the space with a super-plane division of a method, each time the division (select split node) will be the current space in one, each leaf node is in the space of a disjoint area. The sample can be divided into a leaf node to obtain the classification result according to the value of the sample characteristic. The regression tree can be regarded as the extension of the classification tree, and the difference is that a real merit score is corresponding to each leaf node, and then the numerical prediction task is completed.
Boosting and bagging
The integrated learning method is an important part of machine learning, and boosting and bagging are two common types. The so-called integration, is actually a number of "weak" classifier "tied" together to form a "strong" classifier (I come to form the head!!!). )
Boosting: Commonly used is adaboost (Adaptive boosting): the initialization of each training sample assigned equal weight 1/n, and then use the algorithm to train the T-wheel training set, after each training, the wrong classification sample to assign a larger weight, (That is, let the learning algorithm in the subsequent learning focus on the more difficult training examples to learn), each round of learning to get a function (classifier) based on the rate of error classification to obtain a weight, so as to obtain a predictive function sequence h_1,?, h_m, the predictive effect of a good prediction function weight, and vice versa. The final predictive function H uses the ' weighted majority vote ' approach to the classification problem, and uses the ' weighted average ' method to discriminate the new example in the regression problem.
(Serial, the K classifier is trained with a focus on the sample of the K-1 classifier.) Flow-Shop, everyone tells the next person what is more difficult, there is ' experience ' inheritance, learn good people ' say ' more big
Bagging:bootstrap aggregating's abbreviation. Training multiple rounds, each round of training set by a random from the initial training set of n training samples, (that is, an initial training sample can appear multiple times or not at all), after training can be obtained a predictive function sequence h_1,?? h_n, the final predictive function H uses ' Majority voting ' approach, the new example is judged by the ' simple averaging ' approach to regression problems.
(can be done in parallel, the K classifier depends on the selected training sample.) Parallel job, each person learns a part of the content randomly, has the same voting right)
The difference between bagging and boosting: the main difference between the two is that the sampling method is different. Bagging uses uniform sampling, and boosting is sampled according to the error rate, so boosting generally have better classification accuracy than bagging, but in some data sets, boosting can cause degradation (overfit). The selection of bagging training set is random, the training sets are independent of each other, and the selection of the training sets of BOOSTLNG is related to the learning results of the previous rounds; each predictive function of bagging has no weight, and boosting is entitled to heavy Each predictive function of the bagging can be generated in parallel, while the individual predictive functions of boosting are generated in sequence only. For such a time-consuming learning method as a neural network, bagging can save a significant amount of time overhead through parallel training.
Stochastic forest Random Forest
official version:
A random forest is a classifier built in a random way that contains multiple decision trees (cart trees). The category of its output is determined by the majority vote of each tree.
Randomness is mainly embodied in two aspects: (1) When training each tree, select a subset from all the training samples for training (i.e. bootstrap sampling). The residual data is used to evaluate the error, and (2) at each node, a subset of all features is randomly selected to calculate the best segmentation method.
The main advantages of random forest: (1) in large and high dimensional data training, it is not easy to fit and fast, (2) The test speed is very fast, (3) the noise and error in the training data is robust
The training process for random forests can be summarized as follows: (Note the line, column sampling, and Sklearn parameters of bold characters)
(1) Given training set S, Test set T, Feature dimension F.
take Sklearn as an example, determine the parameters: number of cart trees used n_estimators, depth max_depth per tree, number of features used per node max_features, termination criteria: Minimum number of samples on a node min_samples _split, the minimum information gain (or entropy) on a node min_weight_fraction_leaf
(2) in S, there is a pull-back size and S-like training set S (i) (' line sampling '), as a sample of the root node, starting from the root node training
(3) If the termination condition is reached on the current node, then the current node is set to the leaf node, if it is a classification problem, the prediction output of the leaf node is the highest number of C (j) in the current node sample set, and the probability p is the ratio of C (j) to the current sample set; The forecast output is the average of each sample value for the current node sample set, and then continues to train the other nodes. If the current node does not meet the termination criteria, the F-dimensional feature (' column sampling ') is randomly selected from the F-dimensional feature without a return . Using this f-dimensional feature, the one-dimensional feature k and its threshold th with the best classification effect are found, and the samples with the K-dimension feature less than th on the current node are divided into the left node, and the remainder is divided into the right node.
(4) Repeat until all the cart has been trained.
The prediction process using random forests is as follows:
(1) Starting from the root node of the current tree, according to the current node threshold th, determine whether to enter the left or right node (>=th), until a leaf node, and output the predicted value.
(2) Repeat (1) until all T-trees have the predicted value output. If it is a classification problem, the output is the one that has the largest sum of predicted probabilities in all trees, which is the accumulation of p for each C (j), and if it is a regression problem, the output is the average of the output for all trees.
Popular version:
Why does the random forest work well? Isn't that a bunch of 5-slag combinations? In short, 99.9% unrelated trees make predictions that cover all situations, and these predictions will be offset against each other. The predictions of a few excellent trees will be detached from the "noise" and make a good prediction. Random Forest is one of the most "versatile" ways of learning, and you can almost throw anything in it, which is basically available. (In the data game very easy to use, fast running, the parameters are relatively simple, assuming that you initially mention a few features and do not know the effect is good, please choose RF to help you test; Suppose you just improved a model, do not know how generalization ability, please choose RF to help you compare; If you have no idea, ha ha, please choose RF, Because he's better than you guessed.
You think this is over? Do you remember the feature selection in the previous article? Yes, RF can help you fix it, it can handle very high dimensions (feature a lot) of data, and do not have to do feature selection, after training, it can also give which feature more important, training speed, but also support parallel oh, as long as 998, universal classifier take home!
Gbdt-gradient Boost decision Tree
Boosted Tree has various ' vests ', such as GBDT, GBRT (gradient Boosted regression tree), MART (multiple Additive regression tree), Lambdamart is also a variant of boosted tree, the first to be proposed is Friedman.
GBDT consists of three concepts: Regression decistion Tree (DT), Gradient boosting (GB), Shrinkage (learning method, mostly used). DT No more say is the return tree, GB is a relatively large proposition, the old look, popular tastes:
official version
In general, the loss function (loss functions) describes the training error of the model, the larger the loss function, the more error-prone the model is (regardless of variance, deviation equalization). The process of learning keeps the loss function falling, which means that our model is constantly improving, and the direction of the gradient is the quickest direction of function descent (high mathematics), and the gradient descent method (SGD) in neural networks is the idea.
In GBDT, the algorithm flow is as follows:
(0) Given an initial value
(1) Establishment of M-tree (iterative M-times)
(2) Logistic transformation of function estimate f (x)
(3) Perform the following operations (usually vector operations) for K categories:
- the gradient direction of reducing residuals is obtained
- A decision tree consisting of J leaf nodes is obtained based on the gradient direction of each sample point x, which is reduced by its residual error.
- To obtain the gain of each leaf node when the decision tree is established (for prediction purposes)
- Combine the currently obtained decision tree with those of the previous decision trees as a new model
Popular version:
Before the boost algorithm is introduced in the beginning of the algorithm, each sample assigned a weight value, at the end of each step, increase the weight of the points at the wrong point, reduce the weight of points, so that some of the erroneous classification points are "serious concern." Then n iterations to get n simple classifiers (basic learner) combine them to get a final model.
And Gradient boost, each time the calculation is to reduce the last residual (residual), and in order to eliminate residuals, we can be in the gradient of residual reduction (Gradient) direction to establish a new model. That
In gradient boost, each new model is built to reduce the residual of the previous model to the gradient direction .
So what is a residual error? It's simple, the residuals are the cumulative amount of a real value after a predicted value.
Copy an example:
For example, A's true age is 18 years old, but the first tree predicts the age is 12 years old, the difference is 6 years old, namely the residual difference is 6 years old. So in the second tree we set the age of a to 6 years old to study, if the second tree can really divide a into a 6-year-old leaf node, the sum of the two trees is the true age of A; if the second tree concludes that it is 5 years old, then A is still a 1-year-old remnant, and the age of a in the third tree becomes 1 years old, continuing to learn
The idea of Shrinkage (shrinking) is that the effect of getting closer to the result every single step is easier to avoid than fitting in a way that quickly approaches results each time. That is, it does not fully trust each tree, it believes that each tree only learned a small part of the truth, accumulated only a small amount of time, by learning a few trees to compensate for the lack of
Y (i+1) = residuals (Y1~yi), Where: residuals (y1~yi) = y true Value-y (1 ~ i)
Y (1 ~ i) = SUM (y1, ..., Yi)
Shrinkage does not change the first equation, only the second equation is changed to:
Y (1 ~ i) = y (1 ~ i-1) + Step * Yi
GBDT is suitable for all kinds of regression problems, and also has good effect in classification problems such as two classification.
Xgboost-extreme Gradient Boosting
Xgboost in the Kaggle competition, is gradient boosting machine a C + + implementation, first put Chen Tianchi Daniel articles and handouts:
Boosted Tree http://www.52cs.org/?p=429
Handout Http://homes.cs.washington.edu/~tqchen/pdf/BoostedTree.pdf
There are some related information:
Xgboost Guide and actual combat http://vdisk.weibo.com/s/vlQWp3erG2yo/1431658679
Xgboost Parameters
Http://xgboost.readthedocs.io/en/latest/parameter.html
Xgboost parameter setting (translation)
http://blog.csdn.net/zzlzzh/article/details/50770054
In simple terms, unlike the traditional GBDT method, only the first order derivative information is used, and the xgboost of the loss function is done by Taylor expansion , and the whole solution of the regular term is added outside the objective function. It is used to weigh the decrease of the objective function and the complexity of the model to avoid overfitting.
The advantage is that the running speed is fast, can be executed in parallel, the disadvantage is that the parameters are complicated, and the parameter adjustment is not very easy.
Big Data Contest (3)-Model selection I