ResNet Thesis Translation

Source: Internet
Author: User
Tags nets

Resnetabstract

Deeper neural networks are more difficult to train. We propose a residual learning framework to reduce the training of the network, which is much more than the network used before. Instead of learning unreferenced functions, we explicitly redefine the hierarchy as a reference-level input learning residuals function. We provide comprehensive empirical evidence that these residual networks can greatly increase depth to optimize and improve accuracy. On the imagenet data set, we evaluated the residual network up to 152 layers, 8 times times deeper than the VGG network [40], but with a low level of complexity. On the personal website, there are 3.57% errors on the personal site. This is the first step in the ILSVRC 2015 classification task. We also analyzed the CIFAR-10 of layers 100 and 1000. The depth of representation is critical for many visual recognition tasks. We have achieved a 28% relative improvement on the Coco target detection data set, simply because we are extremely impressed. Deep residue is the foundation of our 2015 Vrc&coco competition and the first one we've got in imagenet testing, imagenet localization, coco Detection and Coco segmentation.

1.Introduction

The deep convolutional neural network [22,21] leads to a series of breakthroughs in image classification [21,49,39]. The deep network naturally integrates low/medium/Advanced Features [49] and classifiers in an end-to-end multilayer way, and the "horizontal" feature can be enriched by stacking layers (depth). Recent evidence [40,43] reveals the profound importance of the network, and the major results [40,43,12,16] in the challenging imagenet DataSet [35] utilize the "very deep" [40] model with a depth of 16 [40] 30 [16].

Many other non-trivial visual recognition tasks [7,11,6,32,27] also benefit greatly from very deep models. Driven by the depth of meaning, a problem arises: is learning a better network as simple as stacking more layers? One obstacle to answering this question is the infamous gradient vanishing/exploding problem [14,1,8], which hinders the fusion from the beginning. However, this problem is mainly solved by normalization initialization [23,8,36,12] and intermediate normalization layer [16], while there are more networks by reverse propagation [22] to make random gradient descent (SGD) continue to converge. When deeper networks can begin to converge, a degenerate problem has been revealed: as the depth of the network increases, the accuracy becomes saturated (which may not be surprising) and then rapidly degraded. Unexpectedly, this degradation is not caused by over-coordination, and if more layers are added to the model in the right depth, it can lead to higher training errors, as our experiment [10,41] reported.

Figure 1. Training errors on the 20-and 56-layer "normal" networks on the CIFAR-10 (left) and the darker network with test errors (right) have higher training errors to test for errors. Similar phenomena on the imagenet are shown in 4.

A typical example is shown in Figure 1. Degradation (training accuracy) indicates that not all systems can be easily optimized. Let's consider more shallow buildings and more in-depth architecture, adding more layers. There is a solution in the deep model: the added level is identity mapping, and the other levels are copied from the shallow model of learning. The existence of this built-in solution shows that compared to its lighter counterparts, experiments show that our current solution is not able to find a solution, and that the solution is quite good when it comes to building a solution, or when it is not possible to do so within a reasonable time. In this article, we solve the degradation problem by introducing the deep residual learning framework. We don't want each stacked layer to fit directly into the desired underlying mapping, but rather explicitly let the layers be mapped for residuals.

Formally, the required underlying mapping is represented as H (x), and we let the stacked nonlinear layer map further to F (x): = H (x)-X. The original mapping was re-written to f (x) + X. We assume that it is easier to optimize the residual mapping than to optimize the original, unreferenced mappings. In extreme cases, if an identity mapping is optimal, the similarity between individuals can be more easily deduced through the derivation of the nonlinear hierarchy.

The form of F (x) + x can be achieved through a feedforward neural network with a "shortcut" (Figure 2). Shortcut connections [2,33,48] are those that skip one or more levels of connectivity. In our example, the shortcut connection only performs an identity mapping and adds its output to the output of the stack layer (Figure 2). Identity shortcut Connections Neither add additional parameters nor increase computational complexity the entire network can still be end-to-end trained in SGD through reverse propagation, and can be easily implemented using a common library such as Caffe [19] without modifying the solver.

We present a comprehensive experiment on imagenet [35] to demonstrate degradation and evaluate our approach. Weshowthat:1) The current depth grid is easy to optimize, but the corresponding "simple" network (that is, the simple stack layer) shows a higher training error when the depth increases; 2) deep residual networks can easily derive accuracy gains from depth increases, resulting in a much better result than previous networks. Similar phenomena are also shown in the CIFAR-10 set [20], which shows the difficulty of optimization and the incorrect impact of our methods on data sets that are not relevant. We successfully trained over 100 layers of models on this data set, and explored more than 1000 layers of models.

In the Imagenet categorical dataset [35], we obtained excellent results through a very deep residual network. [40] Our 152-layer residue network is the deepest network ever provided on Imagenet, but still less complex than vgg networks. Our collection has 3.57% of the top five errors in the Imagenet test set and won the first place in the ILSVRC 2015 classification contest. Extremely deep expressiveness also has a very good overall performance in other identification tasks, which allows us to get the first place in the ilsvrc&coco2015 contest: imagenet detection, imagenet localization, coco Detection and Coco segmentation. This evidence suggests that individual learning principles are general and can be used for other visual and non-visual problems.

2.Related work

Residual representations. In image recognition, VLAD [18] is a representation of a dictionary encoded by a residual vector, and Fisher vector [30] can be expressed as a probability version of VLAD [18]. These are powerful shallow representations of image retrieval and classification [4,47]. For vector quantization, the encoding residuals vector [17] display is more efficient than encoding the original vector.

In low-level visual and computer graphics, the widely used Multigrid method [3], in order to solve partial differential equations (PDE), redefine the system as a sub-problem of Multiscale, where each sub-problem is responsible for the scale of the residual solution between coarse and fine. One alternative to multiple meshes is the hierarchical Foundation preprocessing [44,45], which relies on variables that represent residuals vectors. It shows [3,44,45], and these solutions are getting closer to the standard solvers that don't know about the residuals of the solution. These methods are best suited for reconfiguration or preprocessing to simplify optimization. Quick connection. The practice and theory that led to shortcut relationships [2,33,48] have been deeply studied. The early practice of training multilayer perceptron (MLPs) is to add a linear layer from the network input to the output [33,48].

In [43,24], several middle layers are connected directly to the auxiliary classifier to handle the vanishing/exploding gradient. [38,37,31,46] 's paper presents a method for implementing the response, gradient, and propagation errors in a layer through a shortcut connection. In [43], the "Start" layer consists of a quick branch and several deeper branches. In conjunction with our work, the "motorway network" [41,42] provides a quick connection to the gating function [15]. These gates are data-dependent and have parameters that are contrary to our parameter-free identity shortcut keys. When the gated shortcut is "closed" (approaching 0), the layers in the freeway network represent non-residual functions. Instead, our statement always learns the rest of the function; Our identity shortcuts will never be closed, all the information will be passed on, and additional features will be available for learning. In addition, the high-resolution network has not shown the accuracy gain in case of increased depth (e.g. over 100 layers).

3.deepresiduallearning3.1.residuallearning

Let's Think of H (x) as an underlying mapping, represented by several stack layers (not necessarily the entire network), where x represents the input of the first layer. Assuming that multiple nonlinear layers can asymptotically approximate complex function 2, it is equivalent to assuming that they can asymptotically approximate the residual function, that is, h (x)-X (assuming the input and output have the same size). Therefore, instead of expecting the stacking layer to approximate h (x), we explicitly have these layers close to the residual function f (x): = H (x)-X. The original function thus becomes f (x) + x. Although this form should be asymptotically close to the ideal function (hypothesis), the difficulty of learning may be different. This rewriting is driven by the anomaly of the degradation problem (Figure 1, left).

As we discussed in the introduction, if the added level can be built as an independent mapping, then the training error of the pattern will not be greater than its lighter counterparts. The degradation problem indicates that the Solver may have difficulty in approximating identity mapping through multiple nonlinear layers. With the re-expression of residual learning, if the identity mapping is optimal, then the solver can simply push the weights of multiple nonlinear layers to zero to approach identity mapping. In practice, this unlikely effect is optimal, but our rewriting may help to solve the problem in advance. If the best function is closer to the identity map than the zero map, the solver should be more prone to discovering disturbances than the reference identity map, rather than learning the function as a new function. We experiment (Figure 7) to show that the general function of the individual is less responsive, indicating that identity mapping provides a reasonable preprocessing.

3.2.Identity Mapping by shortcuts

We use residual learning for each of the stacked layers. As shown in a build block 2. Formally, in this article, we consider a building block defined as: y = F (X,{wi}) + X.

(1) Here X and Y are the input and output vectors of the layer being considered. The function f (X,{wi}) represents the residual mapping to learn. For an example of two layers in Figure 2, F =w2σ (w1x) where σ represents Relu [29], omitting deviations for simplified notation. The operation of F + x is accomplished by means of a shortcut connection and addition of elements. We use the second nonlinearity after addition (that is, σ (y), see Figure 2).

(1) Describes the complexity of these parameters and calculations. This is not only attractive in practice, but also important in our comparison of pure and remaining networks. We can fairly compare the plain/remaining networks with the same number of parameters, depth, width and cost of calculation (in addition to the sensible joining of applicable elements). The dimensions of X and F in the formula (1) must be equal. If this is not the case (for example, when changing the input/output channel), we can perform a linear projection ws with a shortcut connection to match the dimensions: y = F (X,{wi}) + WSX.

(2)

(2) in the (1) formula, we can use a formula to represent the matrix W. But experiments have shown that identity mapping is sufficient to solve degradation problems and is economical, so WS is only used when matching dimensions. The form of the residual function f is flexible. The experiment in this paper involves a function f with two or three layers (Figure 5), while more layers are possible. However, if f has only one layer, the equation (1) can be simplified to: y = w1x + x because they are undesirable. We have also noticed that although the symbols above are for simple and completely connected layers, they are suitable for convolutional layers. The function f (x,{wi}) can represent multiple convolution layers. The addition of elements is performed on a per-channel basis on two function diagrams.

3.3.NetworkArchitectures

We tested the various plain/residual networks and observed a consistent phenomenon. To provide an example of the discussion, we describe the two models of Imagenet, as shown below. Plain network. Our baseline (Figure 3, middle) is mainly inspired by the theory of Vggnets [40] (Fig. 3 left).

The convolution layer mainly has a 3x3 filter and follows two simple design rules: (i) for the same output feature graph size, the layer has the same number of filters; (ii) If the feature map size is halved, the number of filters is doubled to maintain the time complexity of each layer. We use a 2-step convolution layer to perform the next sampling. The network ends with a global average pool layer and a 1000-way fully connected layer of Softmax. The total number of weighted layers in Figure 3 (medium) is 34. Notably, our models have fewer filters and lower vggnets complexity [40] (Figure 3 left). Our 34-layer baseline has 3.6 billion flop (multiply plus), only 18% of VGG-19 (19.6 billion flop).

Remaining network. Based on the above normal network, we inserted a shortcut connection (Figure 3, right), the network into its corresponding remaining version. When the input and output have the same dimensions (the solid shortcut in Figure 3), you can use the identity shortcut (equation (1)) directly. When the size increases (dashed short scratches in Figure 3), we consider two options: (A) The shortcut still performs an identity mapping, and an additional 0 entries are filled to increase the size. This option does not introduce additional parameters; (B) The projection shortcut in equation (2) is used to match the size (completed by a 1x1 convolution). For both options, when a shortcut key spans two-dimensional feature mappings, they are executed in a 2 stride.

3.4.Implementation

Our implementation of Imagenet follows the practice in [21,40]. The scale of the image is [256,480] random sampling, which is used to increase the scale [40]. A 224x224 crop is randomly sampled from an image or its level, and each pixel is extracted [21]. Standardcoloraugmentationin [21] are used. Weadoptbatch Normalization (BN) [16] immediately after each convolution and before activation, [16]. We initialize weights like [12] and train all plain/residual networks from the beginning. We use SGD, with a minimum batch of 256. The learning rate starts at 0.1, and on average the error is 10, and the model undergoes 60x104 iterations. We use a 0.0001 weight decay and a 0.9 momentum. [13] According to [16], we do not use drop-outs. In the test, we used the standard 10 crop test for comparative studies [21]. For best results, we take the full convolution form in [40,12] and average the score on multiple scales (the image is resized so that the short edge is in {224,256,384,480,640}).

4.experiments4.1.imagenetclassi?cation

We evaluated our approach on the 1000-class imagenet 2012 categorical DataSet [35]. The model is trained on 1.28 million training images and evaluated on 50k validated images. We also get the final result from the 100k test image reported on the test server. We evaluate the error rates of top-1 and top-5. Plain network. We first evaluate the 18-and 34-layer plain nets. Figure 3 (Medium) is a 34-layer flat screen. The form of the 18-layer plain net is similar. The detailed architecture is shown in table 1.

Table 1. The architecture of the imagenet is shown in brackets (see Figure 5), the number of bricks. Down sampling

Performed by Conv3 1,conv4 1 and CONV5 1 with a span of 2.

Table 2. Top-1 Error in Imagenet validation (%,10-crop test). Here resnets no extra parameters compared to their plain counterparts. Figure 4 shows the training program.

The results of Table 2 show that the verification error of the deep network is relatively shallow and the 18-layer planar network is higher. To reveal the cause, we compare their training/validation errors in the training process in Figure 4 (left). We have observed the problem of degradation –

Figure 4: Training on Imagenet A fine curve indicates a training error, and a bold curve indicates a validation error for the central crop. Left: Plain 18-layer and 34-layer networks. Right: 18 and 34 layers of resnet. In this diagram, the residual network has no additional parameters compared to their apparent peers.

Although the solution space of the 18-layer planar network is the subspace of the 34-layer planar network, the training error of the 34-layer planar network is large during the whole training process. We don't think this optimization is likely to be caused by a gradual disappearance. These ordinary networks are trained with BN [16], which ensures that the forward-propagating signals have a non-0 difference. We have also verified that the backward-propagating gradient is well-regulated with bn. So the forward and backward signals will not disappear. In fact, the 34-layer flat screen still achieves competitive accuracy (table 3), which indicates that the Solver works to some extent. We suspect that the deep plain network may have significant low convergence, which affects it to reduce training errors

3. The reasons for this optimization challenge will be studied in the future. Remaining network. Next we evaluate the 18-and 34-tier remaining networks (Resnets). The baseline architecture is the same as the normal network above, expecting 3 (right) to add a shortcut connection on each pair of 3x3 filters. In the first comparison (table 2 and Figure 4 right), we use all the shortcuts for the identity map and the 0 padding to increase the size (optiona). The Sheyvenvenxtra parameter is more obvious correspondence.

We can see three main observations from table 2 and Figure 4. First, the remaining learning situation opposite the--34 layer of ResNet is 18 layers of resnet (2.8%). More importantly, the 34-tier ResNet shows a large amount of inefficient data and passes validation data. This shows that the degradation problem is well solved in this environment, and we try to get the accuracy by increasing the depth. ResNet reduced the first 1 errors by 3.5% (table 2), successfully reducing the transmission error (Fig.4rightvs. left). This comparison verifies the validity of residual learning in very deep systems. Finally, we note that the 18-layer plain/residual network is fairly accurate (table 2), but the 18-layer resnet converges faster (Figure 4, right and left). When the network is "not too deep" (over 18 layers), the current SGD can find a good solution for a flat network. In this case, ResNet simplifies optimization by providing faster convergence at an early stage.

Table 3. Imagenet verified Error Rate (%,10-crop test). VGG-16 is based on our tests. resnet-50/101/152 is option B. Use forecasts only to increase dimensions.

Table 4. Error rate (%) validation set for single-model results on imagenet (except as reported on the test set).

Table 5. The error rate for the collection (%) The top 5 errors are in the test set imagenet and reported by the test server.

Finally, we note that the 18-layer plain/residual network is fairly accurate (table 2), but the 18-layer resnet converges faster (Figure 4, right and left). When the network is "not too deep" (over 18 layers), the current SGD can find a good solution for a flat network. In this case, ResNet simplifies optimization by providing faster convergence at an early stage. Parameter-Independent, the identity accelerator is useful for training. Next, let's look at the prediction method (equation (2)). In table 3, we compare three options: (A) 0 filling Short cuts for adding dimensions, all shortcuts are parameterless (same as table 2 and Figure 4); (b) Projection short cut is used to increase dimension, other shortcut is identity; (C) All shortcuts are projections. Table 3 shows that all three options are much better than the normal counter section. We think this is because the 0-filled dimension in a does have a personal learning ability. Generally better than B, we attribute it to many (13) additional parameters introduced by the projection shortcut. However, the small difference between A/b/C indicates that the projection shortcut is critical to increasing the degradation problem. So we don't use Cintherestoft's paper to reduce memory/time complexity and model size. Identity shortcuts are important for not adding the complexity of the bottleneck architecture described below.

A deeper bottleneck architecture. Next, let's introduce Imagenet's deep network. Worrying about the training time we could afford, we modified the building block as a bottleneck design 4. For each residual function f, we use 3 layers instead of 2 layers (Figure 5). Layer three is the 1x1,3x3,1x1 convolution, where the 1x1 layer is responsible for reducing and increasing (recovering) the size, making the 3x3 layer a bottleneck in the size of the input/output smaller. Figure 5 shows an example where two designs have similar time complexity.

Figure 5. Imagenet's deeper residual function f. Left: A (e.g. 56x56 feature), shown in the ResNet34 in 3. Right: resnet-50/101/152 's "bottleneck" widget.

The non-parametric identity accelerator is especially important for the bottleneck architecture. If the shortcut to identity is replaced with a projection in Figure 5 (right), one can display the time complexity and the model size increase by one times, as a shortcut to connect to two high-dimensional ends. Therefore, the identity accelerator results in a more efficient pattern for the bottleneck design. 50-Layer ResNet: We replace each 2-layer block in the 34-layer network with this 3-layer bottleneck, producing a 50-layer ResNet (TABLE1). We use option B to add dimensions. This model has 3.8 billion flop.

101-and 152-layer resnets: We use more 3-layer blocks (table 1) to build 101-and 152-layer resnets. It is noteworthy that although the depth is significantly increased, the complexity of the 152-layer ResNet (11.3 billion flop) is still lower than the VGG-16/19 network (15.3/19.6 billion flop). The 50/101/152 layer resnets is more accurate than the 34 layer (table 3 and table 4). We have not observed the degradation problem, so we can obtain significant accuracy from the depth. All evaluations have witnessed the benefits of depth

Indicators (tables 3 and 4).

Compare with the most advanced methods. In table 4, we will compare the previous best single-model results. Our benchmark 34-storey resnets has achieved very competitive accuracy. Our 152-layer ResNet has 4.49% single-mode first 5 validation errors. The result of this single model is superior to all previous set results (table 5). We combine six models of different depths into one set (only two 152-tier collections at the time of submission). This results in the first 5 errors of 3.57% in the test set (table 5). This project won ILSVRC 2015 first place.

4.2.CIFAR-10 and Analysis

We have done more research on CIFAR-10 datasets [20], including 50k training images and 10k tests

10 classes of images. We put forward the training experiment on the training set and evaluated it on the test set. Our focus is on very deep network behavior, but not to drive the most advanced results, so we deliberately use simple architectures as follows. The plain/remaining structure follows the middle/right form of Figure 3). The network input is a 32x32 image per pixel average minus. The first layer is a 3x3 convolution. Then we stack the 6n layers of the 3x3 convolution on the feature map with size {32,16,8}, each with a size of 2n layers. The number filters are {16,32,64} respectively. The sub-sample is performed through the convolution with a span of 2. The network ends with the global average pool, 10-way fully connected layer and Softmax. There are a total of 6n + 2 stacked weighted layers. The following table summarizes the schema: so our remainder model has the same depth, width, and number of parameters as the normal model. We use 0.0001 weight attenuation and 0.9 momentum, and use the weights in [12] and BN [16] to initialize, but not lost. These models are trained on two GPUs in small batches of less than 128.

We start with a 0.1 learning rate, dividing it by 10 at 32k and 48k iterations, and terminating the training in 64k iterations, which is determined in the 45k/5k training/Val Split. We trained in the simple data increment of [24]: padding 4 pixels per side, randomly extracting 32x32 clipping from the fill image or its horizontal level. For testing, we only evaluate a single view of the original 32x32 image. We compare n = {3,5,7,9}, resulting in 20,32,44 and layer 56 networks. Figure 6 (left) shows the behavior of the flat screen. The depth of deep flat nets increases, the deeper the depth, the more serious the training error. This phenomenon is similar to imagenet (Figure 4, left) and Mnist (see [41]), suggesting that such optimization difficulties are a fundamental problem. Figure 6 (medium) shows the behavior of the resnets. Also similar to the case of imagenet (Figure 4, right), our resnets managed to overcome the difficulty of optimization and demonstrate accuracy gains when depth increases. We further explored the resulting 110 layers of n = 18RESNET. In this case, we found that the initial learning rate was 0.1 slightly larger and began to converge by 5.

Table 6. CIFAR-10 the classification error on the test device. All the methods and data are enhanced. For ResNet-110, we ran 5 times and showed "best (mean ± standard deviation)" in [42]. So we used 0.01 heating training until the training error was less than 80% (about 400 iterations), then went back to 0.1 and continued training.

The rest of the study plan was completed earlier. This 110-layer network converges well (Figure 6, middle). It has fewer parameters than other depth and fine networks such as Fitnet [34] and Highway [41] (table 6), but is the first result (6.43%, table 6). Layered response. Deviation (STD) of the layer response shown in standard 7. The reply is bn and the output nonlinearity (ReLU/addition) of each 3x3 layer. For resnets, this analysis reveals the response strength of the residual function. Figure 7 shows that resnets usually has a smaller response than a normal peer. These results support our base motivation (Section 3.1) the residual function may be more than 0 more often than the non-residue function. We also note that the deeper the ResNet, the smaller the reaction, by comparing the proofs shown in Figure 7 to the resnet-20,56 and 110 layers, a single layer of resnets is tilted to the modified signal less. Explore over 1000 floors. We actively explore more than 1000 layers of deep models. We set n = 200 that leads to a 1202-layer network, which is trained above as described above. Our approach does not show the difficulty of optimization

This 103-layer network can implement training error <0.1% (Figure 6, right). Its test error is still quite good (7.93%, table 6). But there are still some unresolved problems with this aggressive model. Test results for this 1202-tier network

Figure 6. CIFAR-10 training dashed lines indicate a training error, and a thick line indicates a test error. Left: Normal network. Error

Plain 110 was above 60%, and was not shown. Medium: Resnets. Right: Resnets 110 and 1202 layers.

Figure 7. The standard deviation (STD) response of the CIFAR10 layer response is the output of each 3x3 layer, preceded by BN and b before the nonlinearity. : The layer appears ordered for the original layer. Bottom: The response is sorted in descending order.

Layered response. Figure 7 shows the standard deviation (STD) of the layer response. The response is the output of each 3x3 layer before the BN and other nonlinearity (ReLU/adder). For resnets, this analysis reveals the response strength of the residual function. Figure 7 shows that the network has a more general response than a normal web site. These results support our basic motivation (sec.3.1), that is, the residual function may be closer to 0 than the non-residual function. We also note that the deeper resnet has a smaller response amplitude, which is confirmed by the comparison of resnet-20,56 and 110 in Figure 7. When the number of layers is large, a single resnets layer tends to modify the signal less often. Explore over 1000 floors.

We explored a depth model of over 1000 layers. We set up a leadstoa1202 layer network of n = 200, as described above. Our method does not show the optimization difficulty, this 103-layer network can achieve <0.1% training error (Figure 6, right). Its test error is still quite good (7.93%, table 6).

However, there are still some problems with this positive and in-depth model. The test results for this 1202-tier network are worse than our 110-tier network, although both have similar training errors. We think this is because of over-cooperation. For this small data set, the 1202-tier network may be unnecessarily large (19.4M). To obtain the best results in this data set ([9,25,24,34]), strong regularization such as maxout [9] or dropout [13] is used.

In this article, we did not use maxout/dropout, but simply enhanced regularization with a design-depth and streamlined architecture, without focusing on the difficulty of optimization. However, enhanced regularization may improve the results of our future studies.

Table 7. The object detection map (%) on the Pascal VOC 2007/2012 test set with a baseline faster r-cnn. See also the appendix for better results.

Table 8. Use BASELINEFASTERR-CNN's Coco to verify that the object in the set is detected in map (%). More accessories to get better results.

4.3.Object Detection on PASCAL and Mscoco

Our approach has good generalization performance on other recognition tasks. Table 7 and Table 8 show the object detection baseline results for Pascal VOC 2007 and 2012 [5] and Coco [26]. WEADOPTFASTERR-CNN [] Asthedetectionmethod. What we are interested in here is the improvement of ResNet-101 instead of VGG-16 [40]. Using the detection implementations of the two models (see appendix) is the same and is not possible for a better network. Most significantly, in the challenging Coco data set, we added 6% to Coco's standard metric (MAP @ [5,.95]), which was a relatively high 28% increase. This benefit is entirely due to the expression of learning. Based on deep residue networks, we have gained the issvrc&coco2015 competitive advantage for the first time in multiple industries: imagenet detection, imagenet localization, coco Detection and Coco segmentation.

ResNet Thesis Translation

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.