Microsoft & Zhong Ke proposes a new method of automatic neural architecture design NAO

Last Update:2018-10-21 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Recently, tie and others from Microsoft and China University of Science and Technology published papers, introducing a new method of automatic neural architecture design NAO, which consists of three parts: Encoder, Predictor and decoder. The experimental results show that the architecture found in this method is strong in the image classification task on CIFAR-10 and the Language modeling task on PTB, which is better than or equal to the previous architecture search best method when computing resources are reduced significantly.

From decades ago [13, 22] to now [48, 49, 28, 39, 8], the automatic design of unmanned neural network architecture has always been the interest of the machine learning community. The newest algorithms for automatic architecture design are generally divided into two categories: reinforcement learning (RL) based methods [48, 49, 37, 3] and evolutionary algorithm (EA) methods [42, 35, 39, 28, 38]. In the RL-based approach, the selection of the schema component is considered an action. A series of actions defines the architecture of a neural network, and its development set accuracy is used as a reward. In the EA-based approach, search is performed through the mutation and re-composition of the architectural components, and the better-performing architecture is filtered to continue to evolve.

It is easy to observe that the methods based on RL and EA are essentially performing searches in discrete architectural spaces. Because the choice of neural network architectures is often discrete, such as the size of the filter in CNN and the connection topology in the RNN unit (connection topology). However, it is inefficient to search for the optimal architecture directly in a discrete space, because the search space increases exponentially as the choice grows. This paper proposes a new method of optimizing network architecture, which maps the architecture to a continuous vector space (i.e., network embedding), and optimizes the continuous space by using the gradient-based method. On the one hand, similar to the distributed representation of natural language, the continuous representation of architecture is more compact and effective in representing topological information, on the other hand, because it is more smooth, it is much easier to optimize in continuous space than direct search in discrete space.

The researchers refer to this optimization-based approach as neural-architecture optimization (NAO), as shown in 1. The core of NAO is an encoder model that maps the neural network architecture to a continuous representation (the blue arrow on the left side of Figure 1). A regression model is established on the continuous representation to approximate the final performance of the architecture (such as the classification accuracy on the development set, in the middle yellow portion of Figure 1). It is worth noting that the regression model is similar to the performance predictor in previous studies [4, 27, 11]. The difference between the new approach is how to use the performance Predictor: Previous research [27] used the performance predictor as an inspiration to select the generated schema to speed up the search process, while the new method directly optimizes the module and obtains a continuous representation of a better network through gradient descent (Figure 1, Middle bottom black arrow). The optimized representation is then used to produce a new neural network architecture that predicts better performance. To achieve this, another key module of NAO is designed as a decoder to recover discrete architectures from a continuous representation (the red box arrows to the right of Figure 1). The decoder is a LSTM model with a attention mechanism that allows for precise recovery. These three components, encoder, performance predictor, and decoder, receive joint training in multitasking settings, which facilitates continuous representation: The decoder target of the recovery architecture can further improve the quality of the embedded architecture and predict performance more effectively.

The overall framework of figure 1:nao. The original schema x is mapped to a continuous representation e_x through the encoder network. The e_x is then optimized to ex ' by maximizing the output of the performance Predictor F, and then using the decoder network to convert ex ' to the new schema X '.

The researchers conducted a number of experiments to verify the effectiveness of NAO in image Classification and language modeling tasks. Prior to the use of research [48, 49, 37, 27] commonly used architectural space, the architecture found through NAO achieved a 2.07% test set error rate on CIFAR-10 (using the cutout regularization [12]). In addition, on the PTB data set, the architecture achieves a 55.9 degree of confusion and surpasses the previous optimal approach to neural architecture search. In addition, researchers have shown that using the weighted sharing mechanism in the Enas proposed in the recent [37] to reduce the large complexity of the sub-model parameter space, this method can improve the efficiency of discovering powerful convection and looping architectures, for example, less than 10 hours on 1 GPUs. Researchers will soon release code and models.

Paper: Neural Architecture optimization

Paper Link: https://arxiv.org/abs/1808.07233

Abstract: Automatic neural architecture design is very helpful for discovering powerful neural network structures. Existing methods, whether based on the reinforcement Learning (RL) or evolutionary algorithm (EA), are structured in discrete space to search, very inefficient. In this paper, an automatic neural architecture design method based on continuous optimization is proposed. This new approach is called Neural Architecture optimization (NAO). The method has three key parts: (1) Encoder, embedding/mapping neural network architecture to continuous space, (2) Predictor, the continuous representation of the network as input, and predicting its accuracy, (3) decoder, the network's continuous representation map back to its schema. Performance predictors and encoders allow us to perform gradient-based optimizations in continuous space to find new, potentially more accurate schema embedding. Then decode this better embed using the decoder to the network. The experiment shows that the architecture found in this method is strong in the image classification task on CIFAR-10 and the Language modeling task on PTB, which is better than or equal to the best method of architecture search in the case of significant reduction of computing resources. The test set error rate of the CIFAR-10 image Classification task is 55.9 for the test set of 2.07%,PTB language modeling task. The best architectures found in both tasks can be successfully migrated to other tasks, such as CIFAR-100 and WikiText-2. In addition, combined with the recently proposed weighting sharing mechanism, we found a powerful architecture on CIFAR-10 and PTB in the case of limited computational resources (10 hours on a GPU), an optimal model error rate of 3.53% on the previous task, and a 56.3 degree of confusion on the latter task.

Table 1:cifar-10 the performance of different CNN models on the dataset.

B is the number of nodes within the cell. N is the number of times that the normal cell, which is discovered, is unfolded to form the final CNN architecture. F indicates the filter size. #op is the different operand of a branch in a cell, and is the architectural spatial scale indicator of the automatic architecture design algorithm. M is the total number of network architectures that have been trained to achieve the desired performance. /indicates that the standard has no meaning for a particular algorithm. NAONET-WS represents the schema and weight sharing methods that NAO discovers.

Table 2:cifar-100 the performance of different CNN models on the dataset. The naonet represents the optimal architecture that NAO found on the CIFAR-10.

Table 3:PTB the performance of different models and technologies on the dataset. Similar to the CIFAR-10 experiment, NAO-WS represents NAO with a weighted sharing mechanism.

Table 4:wt2 the performance of different models and technologies on the dataset. Naonet represents the optimal architecture found on the PTB by NAO.

Microsoft & Zhong Ke proposes a new method of automatic neural architecture design NAO

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Microsoft & Zhong Ke proposes a new method of automatic neural architecture design NAO

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Microsoft & Zhong Ke proposes a new method of automatic neural architecture design NAO

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support