Minimalist notes cross-stitch Networks for multi-task Learning
Paper Address: https://arxiv.org/abs/1604.03539
This article studies the problem is the different levels of network weight sharing on the impact of multi-task learning, and on this basis, put forward cross-stitch units (cross-stitch unit) to achieve the optimal network sharing structure of automatic learning.
First of all, this article on the basis of alexnet, at different levels to expand Task-specific Branch, test tasks performance. The article uses < attribute classification, detection > and < plane method vector prediction, semantic segmentation > Two groups of visually related tasks to multitask learning. Experimental results as shown above,< attribute classification, detection > Task pairs, regardless of specific branch at what level, can not improve the performance of two tasks at the same time, that this group of tasks on the nature of contradictions, can not be used for joint training. The < plane vector prediction, semantic segmentation > in the middle part of the phenomenon of increasing performance at the same time, indicating that this group of tasks on the correlation is large, but also shows that the choice of specific branch level has a greater impact on the final performance.
Then the article put forward the cross-stitch unit, the reason is very simple, that is, there are two different tasks of the same structure of the network, and then in each corresponding feature map channel linear combination, and then input to the next network (linear combination parameters can be learned). (However, channel linear combination leads to instability in < attribute classification, detection > Training)
And then there's ablative analysis. The paper points out that the initial parameters of cross-stitch unit are plus and 1 (keep convexity), but the training process is not limited, and its parameter learning rate is greater than backbone net learning rate, which is helpful to accelerate convergence. In addition, training two task-specific network and then adding Cross-stitch finetuning, performance is superior to direct multi-task training. The article still experimented with cross-stitch different initial weights of the final training results cross-stitch weight value (as shown below), the results show that the initial weight set the impact is still very large, and from the limited test results αs:αdαs: αd \alpha_s: \alpha The greater the _d ratio, the better the performance
The paper also uses this structure to try to solve the problem of too few training samples, and the experimental results show that the multi-task form can improve the classification performance of small samples.