Thesis Note Series-darts:differentiable Architecture Search

Source: Internet
Author: User
Summary

My understanding is that the operation between nodes and nodes is discrete, because it is to select one of several operations, and the author tries to use Softmax and relaxation (relaxation) to continue the operation, so the task of the model structure search turns to the continuous variable \ (α={α^{( I,J)}}\) and \ (w\) learning. (Here \ (α\) can be understood as the encoding of the architecture).

Then there is the iterative calculation \ (w\) and \ (α\), which is a double optimization problem, the details of which are described in 3. Approximation

Objective

The author's research goals

Model searches are performed within a continuous domain, so that the model can be optimized using gradient descent.

Problem Statement
    • Structural search problems in discrete domains

Nas,enas its essence is to search the model in discrete space, and this is how diss these methods: those methods of structure search as the black box optimization problem in the discrete domain processing , which led to the need to sample a large number of models to evaluate in order to select the appropriate model, So the amount of computation is very large.

Original:

An inherent cause of inefficiency for the dominant approaches, e.g. based on RL, Evolution, MCTS (Negrinho and Gordon, 201 7), Smbo (Liu et al., 2017a) or Bayesian optimization (Kandasamy et al., 2018), is the fact that architecture search is Tre Ated as a black-box optimization problem over a discrete domain, which leads to a large number of architecture evaluations Required.

    • Search problem of early continuous domain structure

Darts is not the first method to introduce continuous domain search (Saxena and Verbeek, 2016; Ahmed and Torresani, 2017; Shin et al, 2018) is also a structure search in the continuous domain, which is mainly to fine-tune the specific properties of the model structure, such as convolution kernel size, branching mode, etc. But there are some differences between darts and these methods: Darts can discover high-performance architectures with complex graphics topologies in a rich search space, and can generate rnn and CNN models.

Method (s)

The following are the ideas in this section:

1. First describe the search space in general form.

2. A simple continuous relaxation scheme (continuous relaxation Scheme) 1is introduced for search space, which provides a micro-learning goal for joint optimization of structure and its weights.

3. Finally, an approximate method is proposed to make the algorithm feasible and effective in computation.

1.Search Space

Based on previous experience, this paper uses cell as the base unit for model structure search. The units you learn can be stacked into convolutional networks or recursively connected to form a recursive network.

A cell is a non-circular graph consisting of a \ (n\) ordered node. Each node \ (x^{(i)}\) is a latent representation (for example, feature map in CNN)and \ (o^{(i,j)}\) indicates a forward edge \ ((i,j) \) about \ (x^{(i)}\) operation.

Assume that each cell has two input nodes and one output node. For convolution units, the input node is defined as the output of the first two layers (Zoph et al., 2017). For recursive units, the input node is the state of the current input and move-up time. The unit output is obtained by reduction operation (e.g. concatenation) for all intermediate nodes. Each of these intermediate node expressions is as follows:

\[x^{(i)} = \sum_{j<i}{o^{(I,j)} (x^{(j)})}\]

\ (o^{(i,j)}\) has a special operation, the \ (zero\) operation, which indicates that there is no connection between two nodes. So the task of learning to build a cell simplifies the operations on each edge.

2.Continuous Relaxation and optimization

\mathcal{o}\ represents a set of candidate operations (such as convolution, maximum pooling, and so on), and each operation uses \ (O (·) \) representation.

In order to make the search space continuous, we relaxed the classification selection for a particular operation into the softmax of all possible operations, the formula is as follows:

\[\overline{o}^{(i,j)} = \sum_{o∈\mathcal{o}} \frac{exp (α_o^{(I,J)})}{\sum_{o ' ∈\mathcal{o}} exp (Α_{o '}^{(I,J)})}o ( x) \tag{1}\]

Where the manipulation of a pair of nodes (I,J) weights is parameterized by the vector alpha of the dimension \ (|\mathcal{o}|\) .

After the relaxation of the above formula (relaxation), the task of the model structure search is transformed into the learning of continuous variable \ (α={α^{(i,j)}}\) , then \ (α\) is the encoding of the model structure (encoding) As shown in.

In the final search, we need to replace the blending operation (i.e. \ (o^{(I,J)}=argmax_{o∈\mathcal{o}} \,\,α_o^{(I,j)}\) by the maximum possible operation (that is, \ (\overline{o}\)) So that a discrete network structure parameter is obtained,

In order to learn both the architecture α and the weight w,darts in all mixed operations, the gradient descent method is used to optimize the loss value.

The following \ (\mathcal{l}_{train},\mathcal{l}_{val}\) represents the training set and the validation set loss values respectively. Both are determined by \ (α\) and \ (w\) . The ultimate goal of optimization is to find the W^*=argmin_w \ (\mathcal{l}_{val} (w^*,α^*) \) on the premise of satisfying \ (\,\, \mathcal{l}_{train} (w,α) \) minimized \ (α^*\), expressed in the following formula:

\[min_{\,α} \,\, \mathcal{l}_{val} (w^* (α), α) \tag{1} \]
\[s.t. \,\,\, w^* (alpha) =argmin_{\,w} \,\, \mathcal{l}_{train} (w,α) \tag{2}\]

s.t. = SubjectTo, which indicates that the following conditions need to be met, i.e. the formula (1) needs to be evaluated in the case where the formula (2) is satisfied

3.Approximation

It is difficult to calculate the double-layer optimization problem accurately, because as long as the α changes, it is necessary to recalculate the w^* ( α)by solving the formula (2).

So this paper proposes an approximate iterative optimization process in which W and α are optimized by alternating between the gradient descent steps in the weights and the architecture space (see Alg.1 for the algorithm).

Algorithm Description:

Assuming in step K, given the current network structure \ (α_{k-1}\), we calculate the gradient update by \ (\mathcal{l}_{train} (w_{k-1},α_{k-1}) \) to get the \ (w_k\) . Then fix \ (w_k\)to minimize the validation set loss value (Equation 3) by updating the network structure \ (a_k\) , where \ (\xi\) represents the learning rate.

\[\mathcal{l}_{val} (w ', a_{k-1}) = \mathcal{l}_{val} (w_k-\xi \nabla_w \mathcal{l}_{train} (W_{k},α_{k-1}), a_{k-1}) \ Tag{3}\]

The network structure gradient is derived from the derivation of Equation 3 to \ (α\) , as shown in Equation 4 (for ease of writing, to indicate that K has been omitted for steps):

\[\nabla_α\mathcal{l}_{val} (w ', α)-\xi \nabla^2_{α,w} \, \mathcal{l}_{train} (w,α) \nabla_{w '}\mathcal{l}_{val} (w ', α ) \tag{4}\]

The second item in the formula contains a matrix-vector product, which is very difficult to calculate. But we know that the differential can be approximated by the following formula:

\[f ' (x) =\frac{f (X+\epsilon)-F (X-\epsilon)}{2\epsilon}\]

So there are:

\[\nabla^2_{α,w} \, \mathcal{l}_{train} (w,α) \nabla_{w '}\mathcal{l}_{val} (w ', α) ≈\frac{\nabla_α\mathcal{l}_{train} (w^+,α)-\nabla_α\mathcal{l}_{train} (w^-,α)}{2\epsilon} \tag{7}\]

where \ (W^{+}=w+\epsilon \nabla_{w '}\mathcal{l}_{val} (w ', α), W^{-}=w-\epsilon \nabla_{w '}\mathcal{l}_{val} (w ', α) \ )

The calculation of the finite difference requires only two weights forward and two backward pass (α), and complexity from \ (\mathcal{o} (|α| | w|) \) down to \ (\mathcal{o} (|α|+|w|) \).

4.Deriving Discrete Architecture

After the continuous model structure coding \ (α\) is obtained, the method of solving the discrete network structure is as follows:

Evaluation

How the authors assess their own methods, there are no questions or can be borrowed from the place

Conclusion

The contributions are as follows:

    • A new algorithm for the search of the differential network architecture for convolution and cyclic structures is introduced.
    • Experiments show that our method is highly competitive.
    • Achieves excellent structure search efficiency (4 gpu:1 days CIFAR10 error 2.83%; 6 hours PTB error 56.1), which is attributed to the use of gradient-based optimization rather than non-differential search techniques.
    • We have proven that the architectures that darts learn on CIFAR-10 and PTB can be migrated to Imagenet and WikiText-2
Notes

Question: What does the relaxation action mean? Why is it possible to use Softmax for continuous operation? What does the formula (1) mean? \ (α\) what is it?



Marsggbo Original





2018-8-5



  1. baike.baidu.com/item/%e6%9d%be%e5%bc%9b%e6%b3%95/12508962? Fr=aladdin

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.