Multi-tasking learning overview for deep neural networks (an overview of multi-task learning in depth neural Networks)

Source: Internet
Author: User
Tags svm in domain

Translated from: http://sebastianruder.com/multi-task/

1. Preface

In machine learning, we are usually concerned with optimizing a particular indicator, whether it is a standard value or an Enterprise KPI. To achieve this goal, we train a single model or multiple model sets to accomplish the assigned task. We then refine the model by fine-tuning the parameters until the performance is no longer improved. While doing this can get an acceptable performance for a task, we may have overlooked some of the information that helps to do better on the metrics we care about. Specifically, this information is the oversight data for the relevant tasks. By sharing presentation information between related tasks, our model has a better generalization performance on the original task. This approach, called multi-tasking learning (multi-task learning), is the focus of this blog post.

Multi-tasking learning has many forms, such as joint Learning (Joint learning), autonomous learning (learning to learn), and the use of secondary task learning (learning with auxiliary tasks), these are just some of the aliases. As a general rule, once you find that you are optimizing more than one objective function, you can solve it efficiently by multitasking (generally, as soon as you find yourself optimizing more than a loss function , you is effectively doing multi-task learning (in contrast to Single-task learning)). In that scenario, it helps to figure out what we really want to do and how we can get some inspiration from it.

Even if you have only one optimization goal in the most specific case, a secondary task can still help you improve the learning performance of your main task. Rich Caruana summarized in the literature [1] that "multitasking learning improves generalization performance by using domain knowledge contained in the supervisory signals of related tasks". By studying Ben Boven, we will try to make a brief review of the research of multi-task learning, especially the multi-task learning for deep Neural network. First, we explored the source of inspiration for multi-tasking learning. Next, we introduce two of the most common methods of multitasking learning. Then it describes the mechanisms that make multitasking learning effective in practice. Before summarizing the more advanced neural network-based multi-task learning methods, we review some background knowledge in the previous multi-task learning literature. In this paper, a multi-task learning method based on deep neural network is introduced in recent years. Finally, we explored the types of secondary tasks that are used frequently and the characteristics of the ancillary tasks that are spoken for multi-tasking learning.

2. Motivation

We propose that the starting point of multi-task learning is varied: (1) from the perspective of biology, we see multitasking as a simulation of human learning. In order to learn a new task, we usually use the knowledge gained in learning related tasks. For example, an infant learns to recognize a face and then uses that knowledge to identify other objects. (2) from the point of view of pedagogy, the first task we learn is the ability to help us master more complex technologies. This is a very correct way to learn martial arts and programming. With an example of a departure from popular cognition, the film Karate Kid in Miyagi teaches karate kids to polish the floor and wax the car to show that the task is not related. However, the results showed that it was these unimportant tasks that made him have the skills to learn karate. (3) From the point of view of machine learning, we regard multi-tasking learning as a kind of transfer (inductive transfer). The inductive transfer improves the model by introducing a regression bias (inductive bias), which makes the model more prone to certain assumptions. For example, a common kind of inductive bias is L1 regularization, which makes the model more biased towards those sparse solutions. in a multitasking learning scenario, the inductive bias is provided by a secondary task, which causes the model to prefer solutions that can interpret multiple tasks at the same time. Next we'll see that doing so will make the model more generalized.

3. Two kinds of multi-task learning modes in deep learning

We discussed the theoretical source of multi-task learning. In order to make the idea of multi-task learning more specific, we present two methods used in multi-task learning based on Deep neural network: Hard sharing and soft sharing of hidden layer parameters. (1) The hard sharing mechanism of parameters: the hard sharing mechanism of parameters is one of the most common methods in multi-task learning of neural networks, which can be traced back to the literature [2]. In general, it can be applied to all the hidden layers of all tasks, while preserving the output layers associated with the task. The hard-sharing mechanism reduces the risk of overfitting. In fact, the literature [3] proves that the order of the risk of cross-fitting of these shared parameters is N, where n is the number of tasks, which is less than the risk of overfitting of task-related parameters. Intuitive to be, this is very meaningful. The More tasks you learn at the same time, our model captures the same representation of more tasks, resulting in less risk of overfitting on our original tasks. (2) Parameters of the soft-sharing mechanism: Each task is made by its own model, its own parameters. We have regularization of the distance of the model parameters to guarantee the similarity of the parameters. The literature [4] uses L2 distance regularization, while the literature [5] uses trace regularization (trace norm). The constraint of soft sharing mechanism in deep neural network is largely influenced by the traditional method of multi-task learning. We'll talk about it next.

4. Why is multi-tasking learning effective?

Even though it is very popular to get an explanation of the imputation bias from multi-tasking, we must explore its deep mechanisms in order to better understand multi-tasking learning. Most of the mechanisms were proposed by Caruana in 1998. To facilitate the distance description, we assume that there are two related tasks A and B, and that the shared hidden layer represents F. (1) The hidden world data increase mechanism. Multi-tasking learning effectively increases the number of training instances. Since all tasks are more or less noisy, for example, when we train the model on task A, our goal is to get a good representation of task A, ignoring data-related noise and generalization performance. Since different tasks have different noise modes, learning to both tasks can result in a more generalized representation (as different tasks has different noise patterns, a model that learns two tasks Sim Ultaneously is able to learn a more general representations.). If you only learn task A to assume the risk of over-fitting to task A, while learning task A and task B averaging the noise pattern, you can make the model a better representation of F. (2) Attention concentration mechanism. If the task noise is serious, the data volume is small, the data dimension is high, it is difficult for the model to distinguish correlation and irrelevant characteristics. Multitasking helps to focus the model's attention on those features that do have an impact, because other tasks can provide additional evidence of correlation and non-relevance of features. (3) eavesdropping mechanism. For Task B It is easy to learn some feature g, which is difficult to learn for task A. This may be because task a interacts with feature G more complex, or because other features hinder the learning of feature G. With multitask learning, we can allow model eavesdropping (eavesdrop), even with task B to learn the feature G. The simplest way to achieve this is to use hints[6], the training model, to directly predict what is the most important feature. (4) represents the bias mechanism. Multi-tasking learning is more inclined to learn a class of models, and this type of model is more emphasis on the part that is emphasized with other tasks. [7] A new task from the same environment will perform well because of a hypothetical space that is good enough for the training task, so it helps the model demonstrate its ability to generalize to new tasks. (5) Regularization mechanism. Multi-task learning plays the same role as regularization by introducing inductive bias (inductive bias). As such, it reduces the risk of model overfitting and reduces the rademacher complexity of the model, i.e. the ability to fit random noises.

5. Multi-tasking learning in a non-neural network model

In order to better understand the multi-task learning in deep neural network, we first review the multi-task learning method in linear model, kernel method and Bayesian method. Specifically, we will discuss two important ideas that are common in multitasking learning: 1) regularization items that impose sparse constraints on different tasks (enforcing sparsity across tasks through norm regularization) ; (2) The relationship between modeling tasks. It is important to note that the multi-tasking approach in the literature mostly deals with homogeneous scenarios in which they believe that all tasks are related to a single output. For example, a multi-classification problem on a mnist dataset is converted to a 10 two classification problem to solve. In recent years, more work has been done on heterogeneous scenarios: Each task corresponds to a different output.

Sparse regularization of 5.1 blocks (block-sparsity regularization)

In order to better link these methods, we first introduced the meanings of some symbols. We have a T task, each task T, the corresponding model is recorded, the model parameter is recorded, the dimension is D dimension. We use column vectors to represent parameters. Heap These column vectors together to form a matrix. The line I of matrix A corresponds to the I feature of each model, and column J corresponds to the model parameters of Task J.

Many of the existing methods make sparsity assumptions about model parameters. Literature [8] considers a small set of parameters shared by all models. From the point of view of task parameter matrix A, this means that except for a few lines, all are 0, and only a few of them can be shared between different tasks. In order to enforce this, the L1 regularization item is enforced in multi-task learning. What we can remember is that L1 regularization is a constraint on the sum of parameters, forcing all but a few of the other parameters to be 0. L1 regularization is also known as lasso (Least Absolute Shrinkage and Selection Operator).

For a single-task scenario, the L1 regularization calculation relies only on the model parameters in a single task T. For a multitasking scenario, the L1 regularization calculation is based on the task parameter matrix A, first calculating the regularization of each row (corresponding to the first feature of each task), generating the column vectors, and then calculating the L1 regularization of the vector, forcing most of the items in B to be 0.

We can use different regularization, depending on what constraints we want to set for each row. In general, we call this a mixed regularization (mix norm) constraint regularization. As a result of this, the whole behavior of a is 0, so it can be called the Block sparsity regularization (block-sparsity regularization). The literature [9] uses regularization, while Argyriou uses regularization. The latter, also known as Group Lasso, was first proposed in the literature [10]. Argyriou and others in 2007 proved that the optimal non-convex group lasso can be converted to convex optimization problem by the trace regularization (trace norm) constraint on task parameter matrix A. That is, forcing matrix A is a low-rank, where each column vector is in a sub-space of a low dimension. Literature [11] to further the use of group Lasso in multi-tasking learning to establish upper bound constraints.

Block sparse regularization is intuitively popular, and its popularity is the same as it relies on the sharing of parameters between tasks. The literature [12] proves that regularization may be worse than simple element-level regularization when features do not overlap between tasks. Therefore, the article [13] proposes to improve the block sparse model by combining the sparse regularization of the block with the sparse regularization of elements. They decompose the task parameter matrix A into matrices B and S, where a=b+s. Then, using forced block sparse regularization for B, the element sparse regularization is used for s using lasso. A distributed version of Group Lasso regularization is proposed in the literature [14].

5.2 The relationship between learning tasks

Group sparsity constraints can force the model to focus on only a few features, making these features shared by all tasks. All existing methods assume that all tasks in a multitasking learning are closely related to each other. However, this is not true, and not every task is closely related to other tasks. In these scenarios, sharing information with unrelated tasks can cause performance damage. This behavior is called negative migration (negative Transfer). In addition to sparsity, what we want to do is to use a priori knowledge to indicate that it is related to some tasks and not to other tasks, in which case clustering constraints on tasks are more appropriate. The article [15] proposes a clustering constraint to punish the regularization of the column vectors and their variances at the same time:

Which is the average parameter vector. This penalty enforces the clustering of a task vector near its mean, which is used to control. They use it in nuclear methods, but they also apply to linear methods.

The reference [16] proposes a similar constraint for the SVM method. This constraint is inspired by the Bayesian approach, which seeks to make all models approach the mean model. Therefore, its loss function is the large interval loss for each SVM and the distance from the mean model. Literature [17] assumes that a potential clustering regularization can be explicitly expressed as a clustering constraint on a (when the number of clusters is known), the clustering regularization is decomposed into three parts:

Global constraints: How large is our column vector parameter:

Variance constraints between clusters: The distance from the cluster center to the mean:

Variance constraints within a cluster: measure the degree of tightness of each cluster.

Finally, the three parts are linearly combined to get:.

This clustering constraint assumes that clustering or clustering is pre-known, so they introduce a relaxation of constraints on the above-mentioned regularization. There are also scenarios where clustering may not exist in a task, but they do exist in a structure. The literature [18] expands the group lasso into multiple tasks when the tree structure is rendered. The literature [19] expands the group lasso to multiple tasks when rendering the graph structure. Although the relationship between the previous modeling tasks is mostly based on the home plus method, there are some that are not. The literature [20] is the first algorithm that uses KNN to represent task clustering. The literature [21] attempts to learn a common structure between multiple tasks to apply to semi-supervised learning. The Bayesian method is used to study the relationship among many tasks in multi-task learning. The paper [22] proposes a Bayesian neural network for multi-task learning, which encourages the similarity of different task parameters by adding a priori to the model parameters. The paper [23] extends the Gaussian process into multi-task learning by inferring a shared covariance matrix. Because of the high computational cost, they use a sparse approximation model to choose the most informative instances. The literature [24] also uses the Gaussian process to learn to multitask, assuming that many models come from the same priori. The reference [25] takes the normal distribution as a priori and constrains each task-related layer. To encourage similarity between different tasks, they propose that the mean is task-dependent, using a mixed distribution to model the clustering between tasks. What's important is that they need tasks that are characterized by pre-defined clustering and specifying the number of mixed distributions. Based on this, the literature [26] is distributed from the Dirichlet process and can make the model learning the similarity between tasks and the number of clusters (cluster number). All tasks within the same cluster share a model. The article [27] proposes a hierarchical Bayesian model to learn the potential hierarchical relationship between tasks. The literature [28] uses the Gaussian process regularization to do long task learning and expands the Gaussian process into large-scale scenes.

There are other ways to focus on online multi-tasking (multi-task Learning) Scenarios: Literature [29] extends existing methods to online scenarios. They also put forward the regularization of the perceptron in the multi-task problem extension, and the relationship between the tasks with a matrix to represent. They use a variety of regularization items to bias the task-related matrices, such as the correlation of task characteristics, and the correlation of subspace dimensions. It is worth noting that in the past these methods require task characteristics to define a matrix like this. The literature [30] expands these methods by learning to obtain a relational matrix.

Literature [31] assuming that a task forms multiple segmented groups, tasks within the same group are in the same low-dimensional space. Within each group, tasks share a set of features that can be learned together with the group assignment matrix using alternating minimization patterns, but complete segmentation is not the best approach. The literature [32] allows overlapping between two tasks from different groups, assuming that there are a small subset of basic implicit tasks. The parameter vectors for each real task are modeled as:. This is a matrix containing k implicit tasks, which is a coefficient that contains a linear combination of k vectors. In addition, they also constrain this linear combination to be sparse. The overlap between such tasks should be sparse to control the number of shared parameters. The literature [33] learns a small collection of shared assumptions and corresponds each task to a hypothesis.

6. Recent advances in multi-tasking learning with deep neural networks

Although many of the recent deep learning efforts have either explicitly or implicitly used multi-tasking learning as part of their model, the usage is still not more than the two methods we mentioned earlier: Hard sharing of parameters and soft sharing. In contrast, only a small part of the work focuses on proposing a good multi-tasking learning mechanism in deep neural networks.

6.1 Depth Relationship network (deep relationship Networks)

In multi-tasking scenarios for machine vision, these methods are often shared with convolutional layers, and the full-link layer is considered task-related. In the literature [34], a deep relational network is proposed. In addition to the structure of the shared layer and task-related layers, they add a matrix priori to the fully connected layer. This will allow the model to learn the relationship between tasks. This is similar to the Bayesian method we have seen before. However, the problem is that this approach still relies on pre-defined shared structures. This is sufficient for machine vision problems, but is prone to errors in new tasks.

6.2 Fully adaptive feature sharing (fully-adaptive Feature sharing)

From the other extreme, the document [35] proposes a bottom-up approach. Starting with thin networks (thin network), the greedy dynamic widening network uses indicators that automatically group similar tasks. This widening process dynamically creates the branch, shown in 4. However, this greedy approach does not get the overall best. Assigning a precise task to each branch does not allow the model to learn the interaction between more complex tasks.

6.3 Cross Stitch Network (cross-stitch Networks)

The literature [36] connects two independent networks with the soft sharing of parameters. They then described how to use the so-called cross-stitch unit to determine how the networks associated with these tasks could take advantage of other tasks, and to combine the outputs of the front layer with a linear combination. As shown in this structure 5, the cross-stitch unit is added only after the pooling (pooling) layer and the fully connected layer.

6.4 Low level supervision (low supervision)

In contrast, the task of multi-tasking in the field of natural language processing has focused on finding a good hierarchy: the literature [37] shows some basic work in NLP, such as POS tagging, named entity recognition, and so on, should be used as auxiliary tasks, supervised learning at lower levels.

6.5 Combined multitasking model (A Joint many-task models)

Based on this discovery, the literature [38] pre-defines a hierarchical structure that contains multiple NLP tasks, 6, and is used as a joint model for multitasking learning.

6.6 Weighting of losses with uncertainties (Weighting losses with uncertainty)

In addition to the sharing of learning structures, reference [39] uses an orthogonal approach to consider the uncertainties of each task. They adjust the relative weights of each task in the cost function, and get the goal of multi-task learning based on the principle of the uncertainty likelihood function related to maximization task. The three tasks for each pixel depth regression, semantic segmentation, and instance segmentation are shown in frame 7.

Tensor decomposition in 6.7 multi-task learning

Many recent efforts have attempted to extend the existing multi-tasking learning model to deep learning: [40] Some of the existing tensor decomposition techniques are extended to model parameter partitioning to decompose the correlation coefficients of each layer's shared parameters to the task.

6.8 Sluice Network

Finally, we talk about the sluice network mentioned in reference [41], which is a generalization of many multi-task learning methods based on deep neural networks. As shown in 8, this model learns which of the sub-spaces in each layer must be shared and which is a good representation of what is used to learn the input sequence.

6.9 What should I share in my model?

Having reviewed these related tasks, let's summarize what information should be shared in the deep multitasking learning model. In most multi-tasking learning, tasks are derived from the same distribution. Although this scenario is useful for sharing, it does not always hold true. To develop a more robust multi-tasking model, we have to deal with unrelated tasks.

The early multi-tasking model for deep learning requires a shared structure between predefined tasks. This strategy is not suitable for scaling and relies heavily on the multi-tasking structure. The hard-sharing technology of the parameters already proposed in 1997 is still in the mainstream today, 20 years later. Although the hard-sharing mechanism of parameters is useful in many scenarios, hard-sharing techniques quickly fail if the links between tasks are less dense or require multiple levels of reasoning. Recently there have been some work studies to learn which can be shared , and the performance of these jobs will be better than the hard sharing mechanism in general sense. Also, it is useful to learn the capacity of a task hierarchy, especially in multi-granularity scenarios, if the model is known.

As we mentioned earlier, once we are going to do a multi-objective optimization problem, we are learning to do many tasks. Multitasking should not be confined to the knowledge of all tasks confined to the same parameter space, but rather focus on how to make our model learn the interaction pattern that should be between tasks (it is thus helpful to draw on the advances in MTL that we h Ave discussed and enable our model to learn how the tasks should interact with each other).

7. Auxiliary Task (auxiliary tasks)

Multi-tasking learning is a natural fit for scenarios where multiple tasks are simultaneously obtained from the predicted results. This scenario is common in financial or economic forecasts, for example, we may want to know both the relevant impact factors and the prediction results. In bioinformatics, we may want to know the symptoms of multiple diseases at the same time. But in most cases, we focus on only one task. In this section, we'll discuss how to find a secondary task to benefit from multitasking learning.

7.1 Related missions (related tasks)

Using related tasks as a secondary task is a typical choice for multi-tasking learning. To find out what "related tasks" are, here we show some examples. In 1997, Caruana used to predict the characteristics of different roads to assist in the direction of autonomous driving. Literature [42] uses head posture Estimation and facial feature attributes to infer auxiliary facial contour detection tasks. Literature [43] at the same time learning query classification and web search. Literature [44] simultaneously predicts the category and position of the object in the image. Literature [45] simultaneously predicts the duration and frequency of the text-to-language process of the alto.

7.2 Adversarial (adversarial) tasks

Typically, there is no callout data for a related task. However, on some occasions, the tasks we can use are the opposite of what we want to achieve . Such data can be used to combat losses. These losses are not intended to be minimized, but instead use the Gradient reversal layer to maximize the training error . A successful example of this scenario in domain adaptation is shown in the literature [46]. The confrontation task in this scenario is used to predict the field of input. The task loss is maximized by inverse the gradient of the counter task. This is advantageous for the main task, which can cause the model to learn to not distinguish between two domain representations.

7.3 Hint (Hints) Sex task

As mentioned earlier, multi-tasking learning can be a feature that cannot be learned by a single task. Using hints is an effective mechanism for predicting features in a secondary task. A recent example is in natural language processing, in which [47] an input sentence contains positive negative emotional words as an auxiliary task in the affective analysis. Literature [48] in error name recognition will determine whether a sentence contains a name as a secondary task.

7.4 Attention Concentration

A secondary task can be used to focus on a portion of an image that the network may ignore. For example, a single task model typically ignores the subtleties of those images, such as signposts, in the task of learning direction control. Then the predictive signpost can be used as a secondary task. Forcing models to learn to express them, such knowledge can be used in the main task. Similarly, for facial recognition, since these faces are different, we can take the position of predicting facial features as a secondary task.

7.5 Quantization Smoothing

For multi-tasking, the goal of optimization has been quantified. Usually the continuous type is popular, while the available callouts are discrete sets. In most cases, manual evaluations are needed to collect data, for example, to predict the risk of disease or sentiment analysis (positive, negative, neutral), and because the objective function is smooth, using a smaller number of quantitative assistance tasks makes learning easier.

7.6 Predictive input

In some cases, using some features as input is not beneficial for predicting the target output. However, they may be able to guide the process of monitoring learning. In these cases, the feature is used as part of the output, not as input. The literature [49] shows the scenarios in which these problems are applied in practice.

7.7 Using future Predictions now

Some features in many scenarios are only available after making predictions. For example, in autonomous driving, once a car passes an obstacle or road sign, it can be accurately measured. Caruana a case of pneumonia in 1997, and only after the incident can additional diagnostic cases be available. For these examples, the additional data is not available as a feature at the time of input. However, it can be used as a secondary task to impart additional knowledge to the model to assist in training.

7.8 means learning

The goal of the auxiliary task in multi-task learning is to make the model learn the shared representation to help the master task learn. The auxiliary tasks we are currently discussing are implicit in doing this thing. Because they are closely related to the main task, they may be taught to allow these models to learn the representations that favor the main task. A more explicit approach is to use a secondary task specifically to learn a presentation that can be migrated. Cheng, a work in 2015, and the language model goals used in the literature [50] played such a role. Similarly, Autoencoder can also be used for ancillary tasks.

8. Why are secondary tasks beneficial to the main task?

Although in practice we may only be concerned with a secondary task, we have discussed the various ancillary tasks that may be used in multi-tasking learning. Although we do not know which of these will work in practice. One of the basic assumptions for finding a secondary task is that the secondary task should be closely related to the main task or be able to benefit from the learning process of the main task.

However, we do not know what two tasks are relevant or similar. The definition given by Caruana in 1997 is that if two tasks use the same characteristics to make a decision, then two tasks are similar. Baxer added in 2000 that the related tasks theoretically share the same optimal hypothetical class, which is the same inductive bias (inductive bias). The literature [50] proposes that if the data in both tasks produces a free same class of transformation F to get a fixed probability distribution, then two tasks are F-related. Although it is possible to use the same classification problem, it cannot be used for tasks that deal with different problems. Xue and others 2007 proposed that if the classification boundaries (parametric vectors) of the two tasks are closed, then two tasks are similar.

While some progress has been made in understanding the theoretical definition of task relevance in the early days, the recent results have not. Task similarity is not a two-value, but a range. Two more similar tasks benefit more from multitasking learning, and vice versa. so that our model can learn to share which parameters may only temporarily overcome the theoretical deficiencies, and better use of the less closely related tasks . However, we also need a theoretical understanding of the similarity of tasks to help us understand how to select auxiliary tasks.

reference [52] finding that the auxiliary task with a complete and uniform labeling distribution should be more useful for sequential labeling of the main task has been verified in the experiment. In addition, the literature [53] found that non-plateauing's ancillary tasks also bring improvements to Plateauing's main task.

However, these experiments are limited in scope. These recent findings are only a clue to our understanding of the multi-tasking learning in neural networks.

9. Conclusion

This paper focuses on the history of multi-task learning and the latest progress in multi-task learning in deep neural networks. Although multi-task learning is frequently used, the parameter hard sharing mechanism in recent 20 years is still the main paradigm of multi-task learning in neural networks. The job of learning what information to share looks more promising. At the same time, our understanding of the similarity of tasks, the relationship between tasks, the level of tasks, and the benefits of multi-tasking learning is still limited, and we need to learn more to understand the generalization ability of multi-tasking learning in deep neural networks.

10. Reference Documents:

[0] Sharing related task characterization, a text read deep neural network multi-task learning: Https://mp.weixin.qq.com/s?__biz=MzA3MzI4MjgzMw==&mid=2650728311&idx=1 &sn=62b2dcc82657d1ce3bf91fd6a1197699

[1] Caruana. R. (1998). Multitask Learning. Autonomous Agents and Multi-Agent Systems. 27 (1). 95-133.

[2] Caruana. R. multitask learning:a Knowledge based Source of inductive Bias. Proceedings of the tenth International Conference on machine learning. 1993.

[3] Baxter, J. (1997) A Bayesian/information theoretic Model of learning to learn via multiple Task sampling. Machine learning. 28, 7-39.

[4] Duong, L., Cohn. et.al. 2015. Low Resource Dependency parsing cross-lingual Parameter sharing in a neural Network Parser. ACL2015.

[5] Yang, Y et. Al. 2017. Trace Norm regularized Deep multi-task learning. ICLR2017 Workshop.

[6] Abu-mostafa, et. Al. 1990. Learning from Hints in neural Networks, Journal of complexity.

[7] Baxter, J. 2000. A Model of inductive Bias learning. Journal of Aritificial Intelligence.

[8] Argyriou, A. 2007. Multi-task Feature Learning. NIPS2007.

[9] C. Zhang and J. Huang. Model Selection consistency of the Lasso Selection in high dimensional Linear Regression. Annals of Statistics. 2008.

[Yuan], Ming and Yi Lin. 2006. Model Selection and estimation in Regression with grouped Variables. Journal of the Royal Statistical Society. 2006.

[One] Lounici. K, et.al. Taking Advantage of sparsity in multi-task learning. stat.2009.

[Negahban], S. ET. Al. 2008. Joint support Recovery under high dimensional scaling:benefits and perils of l1,\inf-regularization. NIPS2008.

[Jalali], A. et.al. A Dirty Model for multi-task learning. NIPS2010.

Liu, S. et.al. Distributed multi-task Relationship Learning. AISTATS2016.

[Evgeniou], T. et. Al. 2005. Learning multiple Tasks with Kernel Methods. Journal of Machine Learning 2005.

[+] Evgeniou, T. et. Al. 2004. regularized multi-task Learning. KDD2004.

Jacob, L. et. Al. 2009. Clustered multi-task learning:a convex formulation. NIPS2009.

Kim, S. and Xing, Eric P. 2010. tree-guided Group Lasso for multi-task Regression with structured sparsity. ICML2010.

[+] Chen, X. ET. Al. 2010. Graph structured multi-task Regression and an efficient optimization Method for general Fused Lasso.

[Thrun], S. ET. al.1996. Discovering Structure in multiple learning tasks:the TC algorithm. ICML1998.

[Ando], R, K et. Al. 2005. A Framework for learning predictive structures from multiple Tasks and unlabeled Data. JMLR2005.

[Heskes], T. 2000. Empirical Bayes for learning to learn. ICML2000.

[Lawrence], N.D et. Al. 2004. Learning to learn with the informative vector machine. ICML2004.

[+] Yu, K. et. Al. 2005. Learning Gaussian Processes from multiple Tasks, ICML2005.

[Bakker], B. ET. Al. 2003. Task clustering and gating for Bayesian multi-task learning. JMLR2003.

[Xue], Y. ET. Al. 2007. Multi-task Learning for classification with Dirichlet Process Priors. JMLR2007.

[Daume III, H. ET.] Al. 2009. Bayesian multitask learning with latent hierarcies.

[et.al] Zhang, Y. A convex formulation for learning Task relationships in multi-task learning. UAI2010.

[] cavallanti, G. et. Al. 2010. Linear Algorithms for Online multitask classification. JMLR2010.

[+] Saha, A. ET. Al. 2011. Online learning of multiple Tasks and their relationships. JMLR2011.

[To] Kang, Z et. Al. 2011. Learning with Whom to Share in multi-task Feature learning. ICML2011.

[+] Kumar, A. ET.  Al. 2012. Learning Task Grouping and overlap in multi-task learning. ICML2012.

[Crammer], K. et. Al. 2012. Learning multiple Tasks Using Shared hypotheses. NIPS2012.

[A] Long, M. et. Al. 2015. Learning multiple Tasks with the deep relationship Networks.

Lu, Y et. Al. 2016. Fully-adaptive Feature sharing in multi-task Networks with applications in person Attriute classification.

[Misra], I. ET. Al. cross-stitch Networks for Multi-task learning, CVPR2016.

[PNS] Sogaard, A. ET. Al. Deep multi-task learning and low level Tasks supervised at Lower Layers. ACL2016.

[Hashimoto], K. 2016. A Joint multi-task model:growing a neural Network for multiple NLP Tasks.

[Kendail], A. ET. Al. 2017. Multi-task Learning Using uncertainty to weigh losses for Scene Geometry and semantics.

[+] Yang, Y et. Al. 2017. Deep multi-task representation learning:a Tensor factorization approach. ICLR2017.

[Ruder], S. 2017. Sluice networks:learning to Share between loosely related Tasks.

[A] Zhang, Z. 2014. Facial Landmark Detection by Deep multi-task learning. ECCV2014.

Liu, X. ET. Al. 2015. Representation learning Using multi-task deep neural Networks for Semantic classification and information retrieval.

[Girshick], R. 2015. Fast r-cnn. ICCV2015.

[Arik], S. O et. Al. 2017. Deep Voice:real-time neural Text-to-Speech. ICML2017.

[Ganin], T. 2015. Unsupervised Domain adaptation by BackPropagation. ICML2015.

Yu, J. 2016. Learning sentence embeddings with auxiliary Tasks for cross Domain sentiment classification. EMNLP 2016.

[Cheng], H. 2015. Open-domain Name Error Detection Using a multi-task RNN. EMNLP2015.

[Caruana], R. et. Al. 1997. Promoting Poor Features to Supervisors:some Inputs work Better as outputs. NIPS1997.

[+] Rei, M. 2017. semi-supervised multitask Learning for Sequence labeling, ACL2017.

[Wuyi] Ben-david, S. ET. Al. 2003. Exploiting task relatedness for multiple task learning. Learning theory and Kernel machines.

[Alonso], H. M. et. Al. 2017. When is multi-task learning effective? Multitask Learning for Semantic Sequence prediction under Varying Data Conditions. EACL2017.

[Bingel], J. et. Al. 2017. Identifying beneficial Task relations for multi-task learning on Deep neural Networks, EACL2017.

Multi-tasking learning overview for deep neural networks (an overview of multi-task learning in depth neural Networks)

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.