Sequencenet Thesis Translation

Source: Internet
Author: User
Tags dnn mxnet keras

Paper Address: squeezenet
Thesis translation: Mu Ling
Time: November 2016.
Article connection: http://blog.csdn.net/u014540717 1 quotes and motives

The recent research on deep convolution neural Networks (CNN) focuses on improving the accuracy of computer vision datasets. For a given level of precision, there are usually multiple CNN architectures that achieve this level of accuracy. Given the equivalent precision, the CNN architecture with fewer parameters has several advantages:
1. More efficient distributed training. Communication between servers is a limiting factor for the scalability of distributed CNN training. For distributed data parallel training, communication overhead is proportional to the number of parameters in the model (Iandola, 2016). In short, small model training is faster because of the need for less communication.
2. Reduce overhead when exporting new models to clients. For autopilot, companies such as Tesla regularly replicate new models from their servers to their customers ' cars. This practice is often referred to as the Air update (translator: OTA, brush machine students are estimated to know). Consumer Reports have found that the safety of the semi-automatic driving function of Tesla's autopilot has improved gradually with the recent air updates (Consumer Reports, 2016). However, aerial updates of today's typical CNN/DNN models may require a large amount of data transfer. Using Alexnet, this will require 240MB of communication from the server to the car. Smaller models require less communication, making frequent updates more feasible.
3. Feasible FPGA and embedded deployment. The FPGA usually has less than 10MB of on-chip storage and no external memory. For inference, models that are small enough can be stored directly on the FPGA, rather than letting the storage bandwidth become its bottleneck (Qiu, 2016) so that the FPGA can process the video stream in real time. In addition, when CNN is deployed on a dedicated integrated circuit (ASIC), small enough models can be stored directly on the chip, and smaller models can make the ASIC suitable for smaller cores.

As you can see, the smaller CNN architecture has several advantages. With this in mind, we are directly concerned with the problem of identifying the CNN architecture with fewer parameters but equivalent accuracy than the well-known models. We found a structure that we call squeezenet. In addition, we propose a more rigorous approach to search the design space of the new CNN architecture.

The remainder of this article is organized as follows. In the 2nd section, we reviewed the relevant work. Then, in sections 3rd and 4th, we describe and evaluate the squeezenet architecture. After that, we turned our attention to understanding how CNN architecture Design choices affect model size and accuracy. We gain this understanding by exploring the design space of the squeezenet-like architecture. In the 5th section, we design space exploration on the CNN micro-architecture, which we define as the organization and dimension of each layer and module. In the 6th section, we design space exploration on the CNN macro structure, which we define as the high-level organization of the layers on CNN. Finally, we summarize in the 7th section. In summary, sections 3rd and 4th are useful for CNN researchers and practitioners who just want to apply squeezenet to new applications. The remainder is for senior researchers who plan to design their own CNN architecture. 2 related work 2.1 Model Compression

The primary objective of our work is to identify a model with very few parameters while maintaining accuracy. One sensible way to solve this problem is to use the existing CNN model and compress it in a lossy way. In fact, there is a research community around the subject of model compression, and several methods have been reported. A fairly straightforward approach to Denton is to apply singular value decomposition (SVD) to the pre-trained CNN Model (Denton, 2014). Han and others have developed network pruning, which starts with a pre-training model, then replaces a threshold parameter with 0 to form a sparse matrix, and finally performs several iterative exercises on sparse CNN (Han et, 2015b). Recently, Han and others, by combining network pruning with quantification (up to 8 bits or less) and Huffman coding, created a method called deep compression (Han et, 2015a), and further designed a hardware accelerator called EIE (Han et, 2016a), Directly to the compression model operation to achieve substantial acceleration and energy saving. 2.2 CNN micro-structure

Convolution has been used in artificial neural networks for at least 25 years; LeCun and others in the late the 1980s helped to promote the use of digital recognition of CNN (LeCun, 1989). In neural networks, convolution filters are usually 3D, with a height, width and channel as a key dimension. When applied to an image, the CNN filter usually has 3 channels in its first layer (i.e. RGB), and in each subsequent layer Li, the filter has the same number of channels as the Li-1 filter. LeCun's early work (LeCun ET, 1989) uses 5x5x Channel 2 filters, and the most recent vgg (simonyan&zisserman,2014) architecture uses 3x3 filters extensively. including Network-in-network (Lin ET, 2013) and Googlenet series Architecture (Szegedy and others, 2014; ioffe&szegedy,2015; Szegedy and others, 2015; 2016) and other models use 1x1 filters in some layers.

With the design of very deep CNN trends, manual selection of each layer of the filter size becomes cumbersome. To solve this problem, a variety of higher-level building blocks or modules consisting of multiple convolution layers with a specific fixed organization have been proposed. For example, the Googlenet paper presents the inception module, which includes several different dimensions of the filter, usually including 1x1 and 3x3, sometimes 5x5 (Szegedy et, 2014), sometimes 1x3 and 3x1 (Szegedy ET, 2015). Many of these modules may then be combined with an additional self-organizing layer to form a complete network. We use the term CNN microarchitecture to refer to the specific organization and size of each module. 2.3 CNN Macro Structure

Although the CNN micro architecture involves a single layer and module, we define the CNN macro architecture as a system-level organization of multiple modules to the End-to-end CNN architecture.

Perhaps the most widely studied CNN macro-Architecture topic in recent literature is the impact of depth (or layer) in the network. Simoyan and Zisserman presented a group of Vgg (simonyan&zisserman,2014) with 12 to 19 layers, and explained that deeper networks generate higher precision on imagenet-1k datasets (Deng ET, 2009). K.he and others proposed a maximum of 30 layers of deeper cnns, providing higher imagenet accuracy (he et, 2015a).

The selection of connections across multiple tiers or modules is an emerging area of CNN's macro-structure research. Residual network (ResNet) (he and others). , 2015b) and Highway Networks (Srivastava ET, 2015) proposed the use of connections across multiple tiers, such as adding the activation from Layer 3 to the activation from Layer 6, which we call the bypass connection. ResNet's author provides a A/b comparison with a 34-layer CNN with no bypass connection; Adding a bypass connection increases the accuracy of the imagenet Top-5 by up to 2%. 2.4 Space exploration of neural network design

Neural networks (including deep convolution neural networks) have a large design space, with many options for microarchitecture, macro architectures, solutions, and other super parameters. The community seems to want to obtain information on how these factors affect the accuracy of neural networks (i.e. the shape of design spaces). The design space exploration of neural Networks (DSE) focuses on the development of automated methods to detect the availability of higher precision neural network architectures. These automated DSE methods include Bayesian optimization (Snoek ET, 2012), simulated annealing (Ludermir ET, 2006), Random Search (bergstra&bengio,2012) and genetic Algorithms (Stanley and Miikkulainen, 2002). They believe that each of these papers provides a case in which the proposed DSE method produces a neural network architecture with higher accuracy than a representative baseline. However, these papers did not attempt to provide an intuitive shape about the design space of the neural network. At the end of this article, we avoided the automation approach, instead, we refactored CNN in such a way that we can conduct a a A/b comparison of the original rational to study how CNN architecture decisions affect the size and accuracy of the model.

In the following sections, we first propose and evaluate squeezenet architectures that have and do not have model compression. Then we explore the impact of design choices on the squeezenet-like CNN architecture for the micro-architecture and the macro architecture. 3 squeezenet: Keep the precision and have fewer parameters

In this section, we first outline a few of the parameters of the CNN Architecture design Strategy. We then introduce the Fire module, our new building block, to build the CNN architecture. Finally, we use our design strategy to build squeezenet, which consists mainly of fire modules. 3.1 Structure Design Strategy

Our overall goal in this article is to identify a CNN architecture with very little parameters while maintaining competitive accuracy. To achieve this, we used three main strategies when designing the CNN Architecture:

Strategy 1. Replace the 3x3 filter with the 1x1 filter. Given the budget of a certain number of convolution filters, we will choose to make most of these filters 1x1 because the 1x1 filter has 9 times times less parameters than the 3x3 filter.

strategy 2. Reduce the number of channels entered into the 3x3 filter. Consider a convolution layer composed entirely of a 3x3 filter. The total number of parameters in this layer is (quantity of input channels) ∗* (number of filters) ∗* (3 * 3). Therefore, in order to keep the total number of small parameters on CNN, we should not only reduce the number of 3x3 filters (see strategy 1 above), but also reduce the number of input channels for the 3x3 filter. We use the extrusion layer to reduce the number of channels entered into the 3x3 filter, which we will describe in the next section.

strategy 3. Drop the sample at the end of the network so that the convolution layer has a large activation diagram. in convolution networks, each convolution layer produces an output activation diagram of at least 1x1 and often much larger spatial resolution than the 1x1. The height and width of these activation graphs are controlled by the following controls: (1) The size of the input data (such as the 256x256 image) and (2) The selection of the layer to be sampled in the CNN architecture. Most commonly, by setting up (stride> 1) in some convolution or pool layer, the following sampling is designed into the CNN architecture (for example (Szegedy, 2014;simonyan&zisserman,2014;krizhevsky, 2012). If the earlier layer in the network has a larger stride, most layers will have a small activation diagram. Conversely, if most of the layers in the network have a 1 stride and a stride greater than 1 is directed toward the tail of the network, many layers in the network will have large activation graphs. Our intuition is that, in other cases where there is no change, large activation graphs (as a result of deferred sampling) can result in higher classification accuracy. In fact, k.he and H.sun have applied deferred sampling to 4 different CNN architectures, and in each case the delay of sampling results in higher classification accuracy (he&sun,2015).


Figure 1: Microstructure view: Organizes the convolution filter in the Fire module. In this example, s1x1 s_{1x1}= 3, e1x1 e _{1x1} and e3x3 e _{3x3}= 4. We show that the convolution filter is not an activator.

Strategies 1 and 2 are wise to reduce the number of parameters in CNN while attempting to maintain accuracy. Strategy 3 is about maximizing precision on a limited parameter budget. Next, we describe the fire module, which is the building block of our CNN architecture, enabling us to successfully adopt the Strategy 1,2 and 3. 3.2 Fire Module

We define the fire module as follows. The Fire module comprises: An extruded convolution layer (which has only a 1x1 filter) fed to a mixed expansion layer with 1x1 and 3x3 convolution filters; We illustrate this point in Figure 1. The free use of 1x1 filters in the fire module is an application of strategy 1 in section 3.1. We expose three adjustable dimensions (super parameters) in the Fire module: s1x1 s_{1x1}, e1x1 e _{1x1}, and e3x3 e_{3x3}. In the fire module, s1x1 s_{1x1} is the number of filters in the extrusion layer (all 1X1), e1x1 e_{1x1} is the number of 1X1 filters in the expansion layer, and e3x3 e_{3x3} is the number of 3x3 filters in the expansion layer. When we use the Fire module, we set the s1x1 s_{1x1} to less than (e1x1 e_{1x1} + e3x3 e_{3x3}), so the extrusion layer helps limit the number of input channels to the 3x3 filter, as shown in strategy 2 in section 3.1. 3.3 squeezenet Structure

We now describe the Squeezenet CNN architecture. We show in Figure 2 that squeezenet starts with a separate convolution layer (CONV1), then 8 fire modules (fire2-9), and finally a final transition layer (CONV10). We gradually increase the number of filters per fire module from the beginning to the end of the network. The squeezenet executes the max-pooling after the layer Conv1,fire4,fire8 and conv10; These relatively late polling arrangements are based on the Strategy 3 in section 3.1. We give the complete squeezenet architecture in table 1.

Figure 2: The macro architecture view of our squeezenet architecture. Left: squeezenet (section 3.3); Middle: squeezenet with simple By-Pass (section 6th); Right: Squeezenet with complex By-Pass (section 6th). 3.3.1 Other sequencenet details

For brevity, we omitted the details about squeezenet and the number of design choices from table 1 and Figure 2. We provide these design options below. The intuition behind these choices can be found in the paper cited below.

1. In order to enable output activation from 1x1 and 3x3 filters to have the same height and width, we add a 0-filled 1 pixel boundary to the 3x3 filter of the expansion module in the input data.

2. Relu (nair&hinton,2010) is applied to the activation of extrusion and expansion layer.

3, after the Fire9 module to apply the ratio of 50% dropout (Srivastava, 2014).

4, pay attention to the lack of squeezenet in the full connection layer; This design choice is subject to NIN (Lin et al. , 2013) Architectural inspiration.

5, when the training squeezenet, we start from 0.04 learning rate, we linearly reduce the overall training rate of study, such as (Mishkin et al. , 2016) as described. For more information about training protocols (such as bulk size, learning rate, parameter initialization), see our Caffe Compatibility Profile, located here: Https://github.com/DeepScale/SqueezeNet.

6. The Caffe framework itself does not support convolution layers (such as 1x1 and 3x3) that contain multiple filter resolutions (Jia, 2014). To solve this problem, we use two separate convolution layers to implement our expansion layer: a layer with a 1x1 filter and a layer with a 3x3 filter. Then, we connect the outputs of these layers together in the channel dimension. This is equivalent to the implementation of a layer containing 1x1 and 3x3 filters in numerical terms.

We published the squeezenet configuration file in a format defined by the Caffe CNN framework. However, in addition to Caffe, there are some other CNN frameworks, including Mxnet (Chen et al., 2015a), Chainer (Tokui, 2015), Keras (chollet,2016) and torch (Collobert, etc., 2011). Each of them has its own local format for representing the CNN schema. That is, most of these libraries use the same underlying compute backend, such as CUDNN (Chetlur, 2014) and Mkl-dnn (DAS, etc., 2016). The research community has ported the Squeezenet CNN architecture to be compatible with many other CNN software frameworks:

Sequeezenet mxnet (Chen et al., 2015a) interface: (Haria, 2016)
Sequeezenet Chainer (Tokui et al., 2015) Interface: (Bell, 2016)
Sequeezenet Keras (Chollet, 2016) interface: (DT42, 2016)
Sequeezenet Fire Model Torch (Collobert et al., 2011) Interface: (Waghmare, 2016) 4 squeezenet Assessment

We are now turning our attention to assessing squeezenet. In each of the CNN model compression papers reviewed in section 2.1, the goal is to compress alexnet (Krizhevsky ET, 2012) models, which are trained to use imagenet (Deng ET, 2009) (ILSVRC 2012) datasets. Therefore, when evaluating squeezenet, we use the Alexnet 5 and the associated model compression results as the basis for comparison.

Table 1:squeezenet architectural dimensions. (The format of this table is inspired by the Inception2 article (ioffe&szegedy,2015).) )

In table 2, we examine the squeezenet in the context of the most recent model compression results. The method based on SVD can compress the Alexnet model by 5 times times, and reduce the top 1 precision to 56.0% (Denton, 2014). Network pruning achieves a 9 times-fold reduction in the size of the model, while maintaining a baseline of 57.2% top-1 and 80.3% top-5 precision on imagenet (Han et al. , 2015b). The depth compression reduces the model size by 35 times times, while still maintaining the baseline accuracy level (Han et, 2015a). Now, using squeezenet, compared to alexnet, we achieved a 50 times-fold reduction in the size of the model, while satisfying or exceeding the alexnet top-1 and front top-5 precision. We summarize all of the above results in table 2.

It seems that we have exceeded the

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.