Fusion Models (Aggregation model)
If we've got some features or assumptions, and they have some consistency with our goal of machine learning, we can combine these assumptions to make predictions better, such models are called fusion models.
A fusion model is a way to get better predictions by mixing (mix) and combining (combine) some assumptions.
The following is a list of four different combinations and gives a mathematical representation:
- When there are multiple assumptions, we choose the hypothesis of minimizing the error of the test as our most trusted objective function:
- We now have multiple assumptions, and we can give each hypothesis a right to vote, synthesizing all the hypothetical voting results:
- There are a number of assumptions that we can give different assumptions depending on the level of trust for different assumptions, which combines the first two cases:
- If you can specify a function as a factor for each hypothesis, this function describes the number of votes under different conditions:
Example
If there are some weaker assumptions (such as middle, only the horizontal and vertical axis of the classification plane), if we can combine these weak classifiers, we can separate the data effectively, to obtain a strong classifier (weak classifier combination).
The fusion model makes the model more powerful, by combining the weak classifiers, similar to the capabilities of the feature transformations described earlier, and by combining and blending, a more generalized assumption is obtained, which is similar to the regularization that was introduced earlier. Therefore, the Fusion model combines feature transformation and regularization, and a reasonable fusion model can theoretically get a good hypothesis.
Blending uniform Blending for classification problems
Here, as described above, the predictions for each data are obtained by the results of each hypothetical poll. The result of this vote actually reflects the principle of a minority subordinate to the majority, which modifies the minority opinion through majority opinion, and the minority opinion may be some mistake.
Ultimately, a more complex classification boundary is obtained through the mechanism of democratic voting.
Uniform Blending for regression problems
In the regression problem, the final hypothesis is actually a series of hypothetical averages.
The general idea here is that for the same predicted data x, some assumptions underestimate the actual target, GT (X) < F (x), and some assumptions overestimate the actual target, GT (X) > F (x). Such average results may be underestimated and overvalued, reducing the result of the error, averaging a more stable, more accurate way of estimating.
Theoretical analysis of Uniform blending
Here we analyze the relationship between the mean and the mean variance of the objective function after the average and synthesis of any GT (x) and the mean variance of the objective function.
The resulting equation tells us that avg ((gt-f) ^2) and (g-f) ^2 are related, with an AVG ((gt-g) ^2) in the middle.
And so, for the test data set, we analyze the prediction error and get the following equation. Shows that the average eout (GT) is larger than Eout (G), which shows that theoretically the error of uniform blending is smaller than the average prediction error of GT, and the prediction effect is better.
Summary
Now suppose that each time we extract N-pen data from a set of data distributions to construct the GT, the average T GT, get G. If you limit this action, you get the expectation of G hat.
Below we use G hat instead of G in the previous section to get the relationship between GT and G hat.
- G_hat represents some of the GT's common views, consensus, consensus
- AVG (eout (GT)) represents the average performance of the algorithm in different datasets, and is the expected performance of the algorithm.
- AVG (ε-Gt-g_hat) ^2 represents the difference between GT and consensus, explaining how different GT opinions are, how dispersed, called variance
- Eout (G_hat) represents how the consensus is performed, called bias
We can know that the average goal is to find ways to eliminate the variance process and get a more stable performance.
Linear Blending
Suppose now we've got some assumptions gt,linear blending is assigned to different GT different votes, that gives GT different weights αt. The result is a linear combination of GT.
(1) The alpha that makes the model training error minimum is obtained
So how do we get the best αt? The general idea is naturally to make the training error minimum αt, that is, Min Ein (α).
The above equation is similar to the linear regression model that was previously introduced for feature conversion, which requires αt>=0 this restriction. So we can put the GT (·) As a feature conversion, and then using the linear regression model to solve α.
(2) ignore the restriction of α
In the previous linear regression model, we did not use the constraints of the coefficients, how do we transform the αt>=0 of this restriction?
The above equation tells us that when Αt < 0, we can consider Αt GT (x) as a positive coefficient |αt| by the last reverse GT.
我们可以设想二元分类的情况,如果有一个假设,其错误率是99%,那么当我们反过来用的时候,可以得到一个错误为1%的假设,即正确率为99%的假设。从这个角度来看,我们可以不用在意αt的正负问题。
(3) Selection of GT
We use the blending algorithm and need some GT, so how does the GT usually get it? Generally, G is obtained from different models by seeking the best Ein.
But choosing the best GT by minimizing the Ein will cost a lot of complexity, so we're going to use the minimum validation error to make the G choice. Therefore, if using Ein to get linear blending GT will need to pay a greater complexity cost, it is easy to have the problem of fitting.
In fact, the blending algorithm chooses to make the validation error eval the least αt
, not the Ein, and, in order to make the validation error eval and the previously obtained gt
mutual independence, gt
it was obtained from the training set Etrain
gt-
.
The specific process is this:
Get a bunch of data from a smaller dataset, ..., and Dtrain
g1-
then validate the data g2-
gT-
Dval
in the collection by converting it g-
into Z-space data
Linear blending can be learned through a linear model of the transformed (Zn,yn), and the final output is obtained through the g-
α
use of all existing data training and g
is g-
no longer
In the same way, we can use the same process to solve this problem by using nonlinear model, which can make the model more capable, and then extend to the conditional blending model (conditional blending), the disadvantage is that there may be overfitting problems.
If the difference is not quite clear g
g-
, take a look at the "Machine Learning Basics" Verification summary.
Bagging (Bootstrap Aggregation)
Before we assume that we've got a couple before, g
and then we're going to do the blending, can we all just learn g
and combine them g
?
how to get the different G
First, let's consider how we can get the different g:
- Get different g from different models
- The same model, different parameters get different g
- If the algorithm originally had a random part, you could get a different g
- Unlike data sets that get different g, for example, when cross-validation is performed, dissimilar data set cuts can be made differently by G-
Can we get different g from a single piece of data? That's what we're going to do next.
Once again, we review the theoretical results presented earlier, that is, the performance of the algorithm can be divided into bias and variance.
The implication behind this theory is that everyone's consensus is better than a single opinion (g).
We asked for a different set of data in the previous derivation, but we now have only one data available, so how do we do that?
The average of G in the above equation (G-pull), is obtained by an infinite number of data sets of g, and then averaged; in order to approximate the average of G, we use finite but large quantities of t instead; second, using the bootstrapping method in statistics To generate new data based on existing data simulations.
bootstrapping
The data sampled by Bootstrap is randomly averaged out in the original n data, recorded and then re-extracted, and then taken n times, the resulting data is statistically referred to as Bootstrap sample.
Bagging
The method of bootstrap aggregation (BAGging) is to generate a series of different bootstapping mechanisms, which are gt
then voted in a gt
uniform manner and synthesized.
Example
The following example is the classification boundary using the pocket algorithm, where the steps are to get different datasets using the Boostrap method, and then use the pocket algorithm for each data set, so that each pocket algorithm runs 1000 times and gets 25 classification lines (gray lines), Combine these lines to get the final non-linear boundary (black line).
reprint Please indicate the author Jason Ding and its provenance
Gitcafe Blog Home page (http://jasonding1354.gitcafe.io/)
GitHub Blog Home page (http://jasonding1354.github.io/)
CSDN Blog (http://blog.csdn.net/jasonding1354)
Jane Book homepage (http://www.jianshu.com/users/2bd9b48f6ea8/latest_articles)
Baidu Search jasonding1354 access to my blog homepage
"Machine learning basics" mixing and bagging